fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
354 stars 47 forks source link

SURVIVOR merge takes cross chromosomes to merge #63

Closed unique379r closed 5 years ago

unique379r commented 5 years ago

Hi I am trying to merge vcf from breakdancer, delly and lumpy but it seems SURVIVOR taking it wrong while merging. In breakdancer, the call is DEL in chr3 whereas lumpy has chr6 with DUP with more or less same position and SURVIVOR merges this entry as DUP. I think, its bug ??

Can you also explain? how strand information is used while merging the SVtype from VCF?

`## command ls *.vcf > samples.txt

brekdancer.vcf delly.vcf lumpy.vcf

SURVIVOR_v1.0.5

SURVIVOR merge samples.txt 1000 2 1 1 0 50 mergedSV.vcf

Original entry

lumpy

grep '80455101' lumpySVOut/*sort.vcf chr6 80455101 40289 N . . SVTYPE=DUP;STRANDS=-+:9;SVLEN=6442;END=80461543;CIPOS=0,0;CIEND=0,0;CIPOS95=-1219,2;CIEND95=-2,1327;IMPRECISE;SU=9;PE=9;SR=0 GT:SU:PE:SR ./.:9:9:0

Breakdancer

chr3 80455996 . . . . PASS PROGRAM=breakdancer;SVTYPE=DEL;SVLEN=5547;SVEND=80461543 GT:DP 1/.:0

Results

Problemetic merged Entry:

it comes from BreakDancer and Lumpy

chr3 80455101 DUP004590SUR N . PASS SUPP=2;SUPP_VEC=101;AVGLEN=-5994.5;SVTYPE=DUP;SVMETHOD=SURVIVORv2;CHR2=chr3;END=80461543;CIPOS=0,895;CIEND=0,0;STRANDS=-+ GT:PSV:LN:DR:ST:TY:CO 1/.:NA:5547:0,0:+-:DEL:chr3_80455996-chr3_80461543 ./.:NaN:0:0,0:--:NaN:NaN ./.:NA:6442:0,9:-+:DUP:chr3_80455101-chr3_80461543'

fritzsedlazeck commented 5 years ago

Hi, thanks for reaching out. I think I fixed a related bug several weeks ago. Can you clone and recompile the code from github and check? Thanks Fritz

unique379r commented 5 years ago

Hi thank you for your quick reply...i think its lucky day, since i was stuck here from many days. Yeah its working now. But I have to check with some other samples too. Meanwhile can you please explain how STRANS information is used while mergeing vcfs? The reason i am asking because Breakdancer does not have the STRANDS information while lumpy does. And in case of delay its has STRAND but tag name "CT=5to3/3to5" etc.

Thank you once again for your kind help.

fritzsedlazeck commented 5 years ago

Sure sorry I forgot to comment. So strands are read in from the individual VCF. Most of the time its like +/- or 3/5 (delly). These are matched and compared. There is some risk in there since not every caller is reporting it in the same meaning so I usually dont use this as a filter criteria.

If a caller does not have a strand information then SURVIVOR is making a guess. I know this is not optimal, but not much I can do there. Thanks Fritz

unique379r commented 5 years ago

What is definition of SVTYPE=INS by SURVIVOR? I found start and end has big difference in breakdancer? Which should not be as per my understanding from different tool representation such as Delly. "START and END should have only "1" nucleotide difference. OR some other tool such as WHAMG represents "Insertions start and end at the same position".

WHAMG : Insertions start and end at the same position
chr1 1442939 . T <ins> . . SVTYPE=INS;END=1442939;

Delly represents start and start+1=END
chr11 58325539 INS00059630 G GCACACACATGTACACA . PASS PRECISE;SVTYPE=INS;SVMETHOD=EMBL.DELLYv0.7.8;CHR2=chr11;END=58325540;PE=0;MAPQ=60;CT=NtoN;CIPOS=-10,10;CIEND=-10,10;INSLEN=16

but BREAKDancer output has some different way of representation which I really don't understand.
chr1 24662379 . . . . PASS PROGRAM=breakdancer;SVTYPE=INS;SVLEN=-3686;SVEND=24721723

Original output from breakdancer
chr 1 24662379 59+59- chr 1 24721723 59+59- INS -3686 99 37 realigned.bam|37 -nan NA

My command (SURVIVOR_v1.0.6 new commit)
SURVIVOR merge samples.txt 1000 2 1 1 0 50 mergedSV.vcf

Can you please give some advice to overcome this problem? As, while using SURVIVOR merge its not working due to conflict of END within the breakdancer vcf.Note: i tried to make brekadancer svtype as Delly i.e. start END=start+1 and left SVLEN as it is but eventhough SURVIVOR is not taking to integrate them, not even single hit whthin 60 samples.

fritzsedlazeck commented 5 years ago

I see from the example here that for the breakdancer you specify SVEND (which is not recognized by SURVIVOR). Change it into END like Delly and other methods have it. SVLEN is fine, but should be positive (only for deletion it is negative).

For INS: that are insertions. You are right that it is defined as you said the start, end gives the position of the insertion or the region were the insertion occurs, but not its length.

SURVIVOR reads it a bit differently as it takes the start + end + length into account. I hope that helps. Fritz

unique379r commented 5 years ago

Hi again, Thank you so ,much for your gentle reply. I will work on this, f its work then fine otherwise will take just start position from all different tool and compare it since the start, end gives the position of the insertion or the region were the insertion occurs. I feel really don't need SURVIVOR to do this job.

By the can you guide me, how to calculate Allele Frequency along with homo and heterozygous from Merge vcf ? I was thinking to take DR (supporting reference,variant reads in that order) and calculate the Allele freq. all freq = alt read/ref reads+alt reads.

Is that fine? Since, delly have only heterozygous and homozygous information, so i guess i have to stick with it.

output from survivor

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PSV,Number=1,Type=String,Description="Previous support vector">
##FORMAT=<ID=LN,Number=1,Type=Integer,Description="predicted length">
##FORMAT=<ID=DR,Number=2,Type=Integer,Description="# supporting reference,variant reads in that order">
##FORMAT=<ID=ST,Number=1,Type=String,Description="Strand of SVs">
##FORMAT=<ID=QV,Number=1,Type=String,Description="Quality values: if not defined a . otherwise the reported value.">
##FORMAT=<ID=TY,Number=1,Type=String,Description="Types">
##FORMAT=<ID=ID,Number=1,Type=String,Description="Variant ID from input.">
##FORMAT=<ID=RAL,Number=1,Type=String,Description="Reference allele sequence reported from input.">
##FORMAT=<ID=AAL,Number=1,Type=String,Description="Alternative allele sequence reported from input.">
##FORMAT=<ID=CO,Number=1,Type=String,Description="Coordinates">

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  default delly   lumpy
chr17   39848828    130992  N   <DEL>   .   PASS    SUPP=2;SUPP_VEC=011;SVLEN=-31114809;SVTYPE=DEL;SVMETHOD=SURVIVOR1.0.6;CHR2=chr17;END=70963595;CIPOS=0,2;CIEND=0,85;STRANDS=+-   
GT:PSV:LN:**DR**:ST:QV:TY:ID:RAL:AAL:CO ./.:NaN:0:0,0:--:NaN:NaN:NaN:NAN:NAN:NAN     0/1:NA:31114850:97,0:+-:.:DEL:DEL00093894:NA:NA:chr17_39848830-chr17_70963680  ./.:NA:31114767:0,30:+-:.:DEL:130992:NA:NA:chr17_39848828-chr17_70963595
fritzsedlazeck commented 5 years ago

Just to be clear SURVIVOR also compares the size of the insertion and does the whole thing for every type of variant. If you are just focused on locations of ins you are right.

So allele count (across samples) can be taken from SUPP=. However, this does not take the 0/1 or 1/1 into account. If you want to take this into account you need to write your own script. The allele frequency within the sample is exactly what you suggest.

Cheers Fritz

unique379r commented 5 years ago

Thank you very much Fritz for your help and support.