bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

Filter truth vcf files for dbsnp? #110

Closed iserf closed 2 years ago

iserf commented 2 years ago

Hi,

I am training the classifier with samples coming with a truth vcf file containing around 370 SNVs/ Indes. Unfortunately, some of these true variants would normally be filtered byFrequency in dbSNP. Do I have to exclude them from the truth set to improve accuracy of the classifier?

Best wishes,

Flo

litaifang commented 2 years ago

Every variant in the truth VCF file will be considered to be a real somatic mutation by the model. If you remove those variants in the truth vcf file, then those variants (if called) will be considered to be false positives. If you want to exclude those variants from the model altogether, then you can create a bed file excluding those positions (keep in mind bed files are 0-based, so CHR\tPOS-1\tPOS for each line). If you think the variants in the truth vcf files are real somatic mutations, then I would leave them there.