fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
354 stars 47 forks source link

Loss of variants after merging to union and changed depth #100

Closed weishwu closed 5 years ago

weishwu commented 5 years ago

Hi Fritz,

I'm using SURVIVOR to merge Manta and Delly calling results and my command line is: SURVIVOR merge svcalls.txt 1000 1 1 0 0 0 merged_SV.vcf

I noticed that the Manta calls in the output are much fewer than in the input, and I found out that some Manta calls were merged into one call in the output. For example, I have these two separate entries in the Manta input:

chr1 1649043 MantaDEL:317885:0:0:0:0:1 CGCTTTCAGCTAGAGTTTGCTCTCTCTGGTTTTCGGTCTGTGACACACGCAT C 224 PASS END=1649094;SVTYPE=DEL;SVLEN=-51;CIGAR=1M51D;CIPOS=0,50;HOMLEN=50;HOMSEQ=GCTTTCAGCTAGAGTTTGCTCTCTCTGGTTTTCGGTCTGTGACACACGCA GT:FT:GQ:PL:PR:SR 0/1:PASS:138:210,0,135:4,0:13,17

chr1 1649118 MantaDEL:77829:1:1:2:0:0 TCTGGTTTTCGGTCTGTGACACACGCACACTTTCAGCTGGAGTATCCTCTCTA TG 928 PASS END=1649170;SVTYPE=DEL;SVLEN=-52;CIGAR=1M1I52D GT:FT:GQ:PL:PR:SR 0/1:PASS:301:523,0,298:7,0:21,17

They became one entry in the output:

chr1 1649043 MantaDEL:77829:1:1:2:0:0 TCTGGTTTTCGGTCTGTGACACACGCACACTTTCAGCTGGAGTATCCTCTCTA TG 928 PASS SUPP=1;SUPP_VEC=01;SVLEN=-52;SVTYPE=DEL;SVMETHOD=SURVIVOR1.0.6;CHR2=chr1;END=1649094;CIPOS=0,75;CIEND=0,76;STRANDS=+- GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO ./.:NaN:0:0,0:--:NaN:NaN:NaN:NAN:NAN:NAN 0/1:NA:52:1089,704:+-:224,928:DEL,DEL:MantaDEL:77829:1:1:2:0:0:TCTGGTTTTCGGTCTGTGACACACGCACACTTTCAGCTGGAGTATCCTCTCTA:TG:chr1_1649043-chr1_1649094,chr1_1649118-chr1_1649170

I understand that these two variants look like the same SV that was called differently by Manta, but I wonder how SURVIVOR handles this exactly. It seems that SURVIVOR picked up the Manta ID, alleles, and some of the INFO fields from the second entry in the input, but picked up coordinates and some other INFO fields from the first entry. So how does SURVIVOR decide where to pick up the data and what is the criteria to merge entries?

And it seems the allelic-depth was changed dramatically. In the input entries, the PR:SR numbers are 4,0:13,17 and 7,0:21,17 for the two entries separately. However the DR numbers in the output are 1089,704. How does SURVIVOR calculate DR and why is it so different from the input data?

Thanks.

fritzsedlazeck commented 5 years ago

Hi, yes the criteria for a merge giving the parameters you listed above are: Two SV are beeing merged if: +-1kbp on start and stop breakpoints are the same and the type of the SV are the same.

I will take a look at the DR numbers... these are always hard because almost every SV caller are reporting them slightly different. I hope that helps Fritz

weishwu commented 5 years ago

Thanks for your prompt reply! Could I also ask how SURVIVOR picks up what to report when it merges two entries? For my example, It seems that SURVIVOR picked up the Manta ID, alleles, and some of the INFO fields from the second entry in the input, but picked up coordinates and some other INFO fields from the first entry.

fritzsedlazeck commented 5 years ago

Sorry I forgot that.. Sure, SURVIVOR sorts the coordinates over the start position of all overlapping SV and reports the median position. Then it takes that information from that entry that reflects this position. The information per SV is stored in the genotype fields to have everything in the resulting vcf file.

weishwu commented 5 years ago

Again thanks for prompt reply.

In my example, SURVIVOR reports 1649043 and 1649094 as the coordinates, which are the same as the 1st entry in the input. However SURVIVOR takes the Manta ID "MantaDEL:77829:1:1:2:0:0", allele sequences, and SVLEN from the 2nd entry. CIPOS and CIEND are different from 1st entry too. This seems to be against what you said ("Then it takes that information from that entry that reflects this position.")

fritzsedlazeck commented 5 years ago

Ok can I ask if you cloned the latest or used the version / conda to install it? I need to release a new version and update conda. I know in the past there was an inconsistency.

CPOS and CEND are recomputed from the breakpoints reported in the entries.

Happy to look at it. Thanks Fritz

weishwu commented 5 years ago

I did install it with conda yesterday. It's Version: 1.0.6.

fritzsedlazeck commented 5 years ago

ah. Please clone from the current github. The installation is very easy. sorry I haven't updated it in a while because I didn't officially release a new version.

weishwu commented 5 years ago

I see. I'll try the latest version. Thanks a lot!

fritzsedlazeck commented 5 years ago

thanks and sorry for the confusion. Keep me posted. Fritz