fritzsedlazeck / SURVIVOR_ant

A framework to annotate SVs with previous known SVs (vcf file) and or with genomic features (gff and or bed files)
MIT License
13 stars 2 forks source link

SV type aware annotations? #6

Open gnarzisi opened 6 years ago

gnarzisi commented 6 years ago

Is SURVIVOR_ant aware of the SV type (ins, del, dup, tra, etc) when annotating variants?

In our experiment, we provided a VCF file with different SV types (merged from multiple samples with SURVIVOR) and a BED file with only inversions from the 1K genome project. All variants that overlapped the inversions (independently of their type) were annotated. The entries in the BED file look like this one:

1 10296138 10296749 INV:CINV_delly_INV00000621:CINV_delly:0.00039936

We are thinking of splitting our VCF into each SV type, annotate separately, and then merge, but we were hoping SURVIVOR_ant could take care of it!

fritzsedlazeck commented 6 years ago

For inv and tra it should report only genes that overlap with the individual breakpoint. For DEL, DUP and INS it reports genes that overlap with the breakpoints or are within the interval.

however, be aware that there is currently a bug in there, which sometimes misses an interval. I have opened up an issue but could not resolve it right now.

gnarzisi commented 6 years ago

Thank you for the info and for the heads up on the bug.

So, the overlap strategy is slightly different according to the SV type, but nothing forbids, for example, a deletion event from the input VCF to be annotated with an inversion listed in the BED file (--bed parameter).

Is the exact same strategy applied when comparing the annotations/variants extracted from a VCF file (--vcf parameter) to the variants in the input VCF (--svvcf parameter)? Or, is in that case SURVIVOR_ant comparing only variants of the same type? (e.g, DELs with DELs, INVs with INVs, etc.)

fritzsedlazeck commented 6 years ago

Sorry I was thinking about Gene annotations (gff files)... The same rule applies for bed files. For the vcf files overlaps: It handles that similar to SURVIVOR, but reports the overlap in the info field.

So you have some SVs call sets from Delly, but in a bed file and want to annotate a sample vcf with these calls?

gnarzisi commented 6 years ago

We want to annotate a list of variants from multiple samples (merged with SURVIVOR) with a list of known variants (from 1000 genome project, DGV, etc).

We have the known variants grouped into different BED files according to their type and we have been providing this files to SURVIVOR_ant using the --bed option. Do you recommend a different approach?

PS: we tried the --vcf parameter but without success. Even when providing in input the dbVar VCF from your GitHub repo:

https://github.com/NCBI-Hackathons/svcompare/blob/master/test_data/dbvar_estd219.uniq.vcf.gz

The code fails with "segmentation fault" after outputting a long list of "Unknown type!"

fritzsedlazeck commented 6 years ago

Oh yeah sorry for that. SURVIVOR and Sniffles had priorities... The problem of the unknonw type is probably related to MEI and CNV, which the vcf parser does not recognize as it works on INS,DEL,DUP,INV,TRA, BND (for TRA and INV events).

What I usually do is to merge the 1000 genomes with the vcf file (e..g from your multiple samples) using SURVIVOR, but that might be a bit stringent..

For the bed file version you should be able to identify vastly overlapping SVs, but you might need to filter the output a bit. Let me know how it goes and I am happy to work again on this, but it will take me a while... Sorry Fritz

gnarzisi commented 6 years ago

OK, Thanks.

Which one of these two options would you recommend then:

  1. Merge (with SURVIVOR) the multi-sample VCF with the 1000 genome call set VCF file (this can be made type-aware by setting the appropriate option in SURVIVOR merge).

  2. (i) Split the multi-sample VCF by SV type; (ii) annotate (with SURVIVOR_ant) each of the VCFs separately by type using the corresponding BED file; (iii) merge together the annotated VCF files.

fritzsedlazeck commented 6 years ago

1 would be the one I would recommend. You can try difference thresholds on the distance. That should be a cleaner solution.