Prediction output files interpretations

bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq

http://bioinform.github.io/somaticseq/

BSD 2-Clause "Simplified" License

189 stars 53 forks source link

Closed wjianga closed 3 years ago

wjianga commented 3 years ago

Hi,

I got the output files from the prediction mode run, but I do not know how to interpret them.

Which file to look at for true variants? SSeq.Classified.sSNV.tsv or SSeq.Classified.sSNV.vcf?
I checked the SSeq.Classified.sSNV.tsv file's "TrueVariant_or_False" column, and it only has "nan" values. How could I fix this?
The number of variants in SSeq.Classified.sSNV.vcf is exactly the same as in consensus.vcf. Why is that? I assume SomaticSeq will filter out the false positive?

Thanks!

litaifang commented 3 years ago

SSeq.Classified.sSNV.tsv with SCORE>0.7 (even better if >0.9) or SSeq.Classified.sSNV.vcf with PASS label are considered high-confidence variant calls.
TrueVariant_or_False is usually nan unless it's training data with truth file attached to it. So nothing wrong with it.
By default, every record in the .tsv file is written into the .vcf files for completeness. Filter for "PASS" label for high-confidence calls.

wjianga commented 3 years ago

Thanks! I will try filtering the variants based on that. I will update you on how it goes.

wjianga commented 3 years ago

I have successfully filtered the variants based on the "PASS" label. Thanks!