Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
44 stars 16 forks source link

[User Story] Artefact databases for SNVs and InDels #1377

Open mathiasbio opened 8 months ago

mathiasbio commented 8 months ago

Need

As a geneticist I want to see true variants and not false positive calls. Currently we have databases for annotating variants that are commonly observed as somatic in highly filtered T+N cases, as well as two databases for detected germline variants, one detected in balsamic and the other in MIP. What is lacking is a database which aims to collect artefacts, which can otherwise increase the workload unnecessarily for a geneticist, increase TAT and in the worst case lead to false reports.

Suggested approach

Do somatic SNV calling on merged WGS normal samples where the data is extracted for ALL panel regions, see reasoning below:

Considered alternatives

No response

Deviation

No response

System requirements assessed

Requirements affected by this story

No response

Risk assessment needed

Risk assessment

No response

SOUPs

No response

Can be closed when

No response

Blockers

https://github.com/Clinical-Genomics/BALSAMIC/issues/1376

Anything else?

No response

mathiasbio commented 5 months ago

Probably if we switch to using TNscope for even the panel data this type of database would be useful for all panels as well!

mathiasbio commented 1 week ago

I have started looking into this issue even in release 16 as a potential solution to adress the increased number of variants in tumor only TGA analyses since the addition of TNscope. I have done the following so far: (see sheet: https://docs.google.com/spreadsheets/d/1MjHLPSWD78rMaEP4wvJO4HAEWIx-U2eN9c27cBWohu0/edit?gid=0#gid=0)

mathiasbio commented 1 week ago

I have also tested the filtering of the above database after 20 groups were added to the LoqusDB on a clinical.filtered.pass VCF from this PR: https://github.com/Clinical-Genomics/BALSAMIC/pull/1475 specifically a myeloid case where the number of variants had almost trippled since adding TNscope.

However, even filtering out variants that only occurred 1 time in the 20 groups of merged normalbamfiles, only a small subset of variants were filtered out. About 100 out of the total 2000 variants.

Most of the variants that were added in this sample (and probably this applies to other tumor only cases too) were InDels:

image

In barplot above the v15 corresponds to unique variants in v15 of balsamic, and v16 corresponds to unique variants in the above PR when TNscope is added and merged with VarDict results.

And a lot of them are InDels added in homopolymer regions:

image

Where the repeat-units comes from TNscope and counts the number of repetitive elements, such as if T is deleted it counts how many T's are in a row, and AF is the allele frequency.

Likely many of these variants are not interesting, and are probably filtered out in the tumor + normal matched analysis which is why we don't see the issue of increased variants in that analysis.

As can be seen the frequency of these variants are however quite low, and are probably unlikely to be captured in even the 7 merged WGS normal samples which should have on average 210X coverage. And as expected a significant number of the variants that could be filtered out by the WGS normal artefact database were of this homopolymer indel-type:

image

Above only InDels with more than 5 repeat-units are shown, along with their frequency in the WGS normal database. The conclusion I draw from this is that probably the normal coverage after merging 7 - 30X normals is not high enough to capture many artefacts. Which is why I now plan to test another approach

mathiasbio commented 1 week ago

New approach:

Do somatic SNV calling on merged WGS normal samples where the data is extracted for ALL panel regions, see reasoning below: