bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

Are multi-nucleotide and complex variants ignored? #129

Open rebber opened 5 months ago

rebber commented 5 months ago

Hi,

We use somaticseq to just merge variants from Mutect2 and HMF Tools SAGE (the latter as "arbitrary" vcf's), the classification module is not used currently. However we were missing some multi-nucleotide variants (MNVs) in the somaticseq output, so I looked into the somaticseq code for how they are handled. I found that it seems any variants in input vcf's with both REF and ALT with length >1 base are ignored.

I see the following division into SNVs or indels, both in modify_ssMuTect2.py and splitVcf.py (for preparation of arbitrary vcf's):

if len(vcf_i.refbase) == 1 and len(vcf_i.altbase) == 1:
    snv_out.write( new_line + '\n' )
elif len(vcf_i.refbase) == 1 or len(vcf_i.altbase) == 1:
    indel_out.write( new_line + '\n' )

And any other variants, i.e len(vcf_i.refbase) > 1 and len(vcf_i.altbase) > 1, will be skipped.

Is it a correct observation that MNVs and complex variants are ignored? What was the reasoning behind setting it up like this? Is there any way to go around it?

We do not want to miss these types of variants, and have to look into other tools if we can't avoid this behaivour with somaticseq.

Best regards Rebecka

litaifang commented 5 months ago

Yeah MNV and complex variants are limitations because they just haven't been our focus. For VarDict, MNVs are parsed into multiple SNVs. I can do similar things for MuTect2 and other outputs, i.e., parse MNVs into multiple SNVs, and parse complex variants into SNVs and indels. If you can give me examples of complex variants and MNV's from those outputs, I can incorporate them.

rebber commented 5 months ago

Thanks for a quick reply! Primarily we want to keep any MNVs and complex variants together, in order to get proper annotation of them by VEP. We will therefore look into some other solution for variant merging from different callers