bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
194 stars 53 forks source link

The format of Truth VCF files #42

Closed mingyi-wang closed 7 years ago

mingyi-wang commented 7 years ago

I tried to run somaticseq to train the model based on truth VCF files. However, after running SSeq_merged.vcf2tsv.py step, I found the last column (TrueVariant_or_False) of the Ensemble.sSNV.tsv are incorrect. Some true variants are tagged as "0". I traced back and found the truth VCF file I used ordered chromosome in the the default ordering (1,10, 11, ..., 2, MT,X) rather than natural ordering (1,2, ..., 10, 11, ..., MT,X). Is this the reason that produced last column in error? If yes, what's the ordering requirement for a truth VCF? Thanks,

litaifang commented 7 years ago

Yes, all the input BAM and VCF files are assumed to be ordered in the same order as the input reference file. So if some files are ordered 1, 10, 11, ...., and some other files are ordered 1, 2, 3, ...., they will not be annotated correctly. I plan to introduce a ordering check in the future and produce a warning if they aren't ordered correctly.