fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
347 stars 46 forks source link

behavior of SURVIVOR merge on multiple samples from multiple callers #135

Open danrlu opened 3 years ago

danrlu commented 3 years ago

Thank you for making such a useful tool!!

We have multiple samples and each sample have multiple vcfs generated by different SV callers. Based on discussion in #95, we should do the following:

STEP 1: merge all vcfs for the same sample into 1 vcf per sample with SURVIVOR merge

The result is a multi-column vcf, with headers copied from the individual vcfs:

Resulting vcf for Sample1:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1_manta   Sample1_delly   Sample1_smoove
I   20105   101 N   <DEL>   180 PASS    SUPP=2;SUPP_VEC=101;SVLEN=95;SVTYPE=DEL;SVMETHOD=SURVIVOR1.0.7;CHR2=I;END=20163;CIPOS=0,819;CIEND=0,892;STRANDS=+-  GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 0/1:NA:131:10,0:--:180:INV:INV00000001:NA:NA:I_20924-I_21055    ./.:NaN:0:0,0:--:NaN:NaN:NaN:NAN:NAN:NAN    0/0:NA:58:0,7:+-:0:DEL:101:NA:NA:I_20105-I_20163

For Sample2:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample2_manta   Sample2_delly   Sample2_smoove
I   20105   DEL000SUR   AAATTTTTTTTCCGCAAAATCAGGAAAAATTCAGAAAAAGACAGTCAAAAAATTGTAGA ATC 601 PASS    SUPP=3;SUPP_VEC=111;SVLEN=82;SVTYPE=DEL;SVMETHOD=SURVIVOR1.0.7;CHR2=I;END=20163;CIPOS=0,819;CIEND=0,892;STRANDS=+-  GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 0/0:NA:131:11,0:--:180:INV:INV00000001:NA:NA:I_20924-I_21055    1/1:NA:58:0,18:+-:601:DEL:I_20105_20163_-58:AAATTTTTTTTCCGCAAAATCAGGAAAAATTCAGAAAAAGACAGTCAAAAAATTGTAGA:ATC:I_20105-I_20163 1/1:NA:58:0,7:+-:354:DEL:101:NA:NA:I_20105-I_20163


STEP 2: combine the 1 vcf per sample for all samples with SURVIVOR merge

The headers used the first headers in each of the vcfs above, but the fields seems re-computed combining columns in each input vcf.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1_manta   Sample2_manta
I   20105   DEL000SUR   AAATTTTTTTTCCGCAAAATCAGGAAAAATTCAGAAAAAGACAGTCAAAAAATTGTAGA ATC 601 PASS    SUPP=2;SUPP_VEC=11;SVLEN=-89;SVTYPE=DEL;SVMETHOD=SURVIVOR1.0.7;CHR2=I;END=20163;CIPOS=0,0;CIEND=0,0;STRANDS=+-  GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 0/1:101:95:0,0:+-:180:DEL:101:NA:NA:I_20105-I_20163 1/1:111:82:0,0:+-:601:DEL:DEL000SUR:AAATTTTTTTTCCGCAAAATCAGGAAAAATTCAGAAAAAGACAGTCAAAAAATTGTAGA:ATC:I_20105-I_20163


I was a bit confused by the discussion in #127

no SURVIVOR does not take the GT into account as many tools often dont report the GT.

This was referring to STEP 1, right? In which case the GT field was simply copied over from input vcfs. Whereas in STEP 2 it looks like the GT most different from REF was kept while merging calls for the same sample (1/1 > 1/0 > 0/0)?

Option was SURVIVOR merge ... 1000 1 0 0 1 30, and version is 1.0.7 from bioconda.

Thanks! Dan