Open DarioS opened 5 years ago
Dear DarioS, Thank you for the observation and suggestion - such examples are definitely of concern. However, we do already have an EVS feature ("SiteHomopolymerLength") for this - it is calculated slightly differently from the indel version, hence the different name. It turns out that the classifier does not make heavy use of this feature, suggesting that homopolymers (on their own) are not strongly predictive of false positives in our training data. That is, the model found that penalizing these situations heavily would, on average, filter out more (known) good than (known) bad calls. This may be a situation where better truth sets are needed.
You mention SiteHomopolymerLength, but the Supplementary Information explains that it's only used for Germline SNVs, not for Somatic SNVs. the example I showed above is a somatic SNV; one sample is normal and one is tumour.
Ah, thanks for pointing that out - you are correct, our somatic feature set is less well explored and could well benefit from adding such a feature.
Looking at each somatic SNV with PASS value for FILTER, I notice many of them are near the end of a homopolymer sequence in the reference genome. Illumina sequencing is known to have a problem in such repetitive contexts. Can this information be incorporated into the filtering or the EVS calculation? I notice in Supplementary Note 1 of the journal article introducing Strelka2 that homopolymers are not considered for SNVs but only for indels. For example,
I would not think that the SNP identified should pass filtering or have a high EVS score. The adjacent indel is filtered out because of Low EVS, as expected. I use version 2.9.9 of Strelka2.