fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
354 stars 47 forks source link

merging svaba #139

Open anoronh4 opened 3 years ago

anoronh4 commented 3 years ago

i was unable to merge SvABA calls using SURVIVOR and I don't see any issues regarding SvABA, so this might be a new topic for this tool. My command is: SURVIVOR merge sample_files 100 1 0 0 0 50 sample_merged.vcf i am allowing a minimum of one caller but SURVIVOR seems to simply skip svaba ones, probably not recognizing them as SVs. and i have generated a simple comparison matrix (the third vcf is svaba):

$ cat sample_merged.mat.txt 
152 122 0   125 
122 545 0   406 
0   0   0   0   
125 406 0   770 

a few sample lines from the svaba vcf include:

$ grep -v ^# svaba.somatic.sv.vcf | head
1   15142819    9184:1  T   [1:15145551[T   33  PASS    DISC_MAPQ=60;EVDNC=ASDIS;HOMSEQ=TTG;MAPQ=60;MATEID=9184:2;MATENM=0;NM=0;NUMPARTS=2;SCTG=c_1_15141001_15166001_51C;SPAN=2732;SVTYPE=BND  GT:AD:DP:GQ:PL:SR:DR:LR:LO  0/0:0:40:10.8:0,10.8,118.8:0:0:10.93:0  0/1:13:36:33.2:33.2,0,53:6:10:-33.08:33.38
1   15143384    9186:1  C   C]1:15144465]   22  PASS    DISC_MAPQ=60;EVDNC=ASDIS;HOMSEQ=GT;MAPQ=60;MATEID=9186:2;MATENM=0;NM=0;NUMPARTS=2;SCTG=c_1_15141001_15166001_132C;SPAN=1081;SVTYPE=BND  GT:AD:DP:GQ:PL:SR:DR:LR:LO  0/0:0:42:11.4:0,11.4,125.4:0:0:11.47:0  0/1:9:26:22.7:22.7,0,39.2:3:8:-22.61:22.9
1   15144465    9186:2  T   T]1:15143384]   22  PASS    DISC_MAPQ=57;EVDNC=ASDIS;HOMSEQ=GT;MAPQ=60;MATEID=9186:1;MATENM=0;NM=0;NUMPARTS=2;SCTG=c_1_15141001_15166001_132C;SPAN=1081;SVTYPE=BND  GT:AD:DP:GQ:PL:SR:DR:LR:LO  0/0:0:42:11.4:0,11.4,125.4:0:0:11.47:0  0/1:9:26:22.7:22.7,0,39.2:3:8:-22.61:22.9

Svaba seems to only return BND, not sure if that's an issue. my merge command allows a minimum of one caller and ignores conflicting types.

anoronh4 commented 3 years ago

here's a related example. we have two vcfs that are both called by SvABA on the same sample but vcf A was run by us and vcf B was called by another organization. I was able to find two calls that are likely referring to the same SV but vcf B was included but vcf A was excluded. calls from vcf A:

1   15142819    9184:1  T   [1:15145551[T   33  PASS    DISC_MAPQ=60;EVDNC=ASDIS;HOMSEQ=TTG;MAPQ=60;MATEID=9184:2;MATENM=0;NM=0;NUMPARTS=2;SCTG=c_1_15141001_15166001_51C;SPAN=2732;SVTYPE=BND  GT:AD:DP:GQ:PL:SR:DR:LR:LO  0/0:0:40:10.8:0,10.8,118.8:0:0:10.93:0  0/1:13:36:33.2:33.2,0,53:6:10:-33.08:33.38
1   15145551    9184:2  C   [1:15142819[C   33  PASS    DISC_MAPQ=60;EVDNC=ASDIS;HOMSEQ=TTG;MAPQ=60;MATEID=9184:1;MATENM=0;NM=0;NUMPARTS=2;SCTG=c_1_15141001_15166001_51C;SPAN=2732;SVTYPE=BND  GT:AD:DP:GQ:PL:SR:DR:LR:LO  0/0:0:40:10.8:0,10.8,118.8:0:0:10.93:0  0/1:13:36:33.2:33.2,0,53:6:10:-33.08:33.38

calls from vcf B:

1   15142891    3476:no_id:1    G   [1:15145551[G   .   PASS    BLACKLIST=0;EVDNC=ASDIS;HOMSEQ=CAA;MAPQ=60;MATEID=3476:no_id:2;MATEMAPQ=60;NDISC=0;NSPLIT=0;NUMPARTS=2;SCTG=c_1_15142501_15148501_16;SPAN=2660;SVTYPE=BND;TDISC=8;TSPLIT=4  NALT:NALT_RP:NALT_SR:READ_ID    0:0:0:. 11:8:4:HS7_5723-2-46-16007-41348,HS14_5615-1-63-4915-9134,IL34_5198-4-118-3337-18138,IL34_5198-7-27-17628-16584,IL34_5198-8-6-16510-14867,HS17_5841-7-1201-14741-147696,IL34_5198-2-106-13813-18791,IL34_5198-3-107-18322-17933,HS14_5615-2-3-9163-127602,HS14_5615-2-4-15789-93334,IL34_5198-2-72-5044-15786
1   15145551    3476:no_id:2    C   [1:15142891[C   .   PASS    BLACKLIST=0;EVDNC=ASDIS;HOMSEQ=CAA;MAPQ=60;MATEID=3476:no_id:1;MATEMAPQ=60;NDISC=0;NSPLIT=0;NUMPARTS=2;SCTG=c_1_15142501_15148501_16;SPAN=2660;SVTYPE=BND;TDISC=8;TSPLIT=4  NALT:NALT_RP:NALT_SR:READ_ID    0:0:0:. 11:8:4:HS7_5723-2-46-16007-41348,HS14_5615-1-63-4915-9134,IL34_5198-4-118-3337-18138,IL34_5198-7-27-17628-16584,IL34_5198-8-6-16510-14867,HS17_5841-7-1201-14741-147696,IL34_5198-2-106-13813-18791,IL34_5198-3-107-18322-17933,HS14_5615-2-3-9163-127602,HS14_5615-2-4-15789-93334,IL34_5198-2-72-5044-15786

Thinking that my minimum length parameter was excluding some of the calls, changed minimum length to 0. This did not solve the issue -- 100% of entries had SUPP_VEC=10.

anoronh4 commented 3 years ago

well i think i figured it out. SURVIVOR does not like the GT field, but this is (somewhat) remedied by specifying minimum caller to be 0 instead of 1. Then all the calls accepted from vcf A say SUPP_VEC=00. not ideal, so instead I'll just have to clean out GT from the vcf before merging.

I won't close this issue myself because I think the function could be improved by adding options for handling GT.