Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
45 stars 16 forks source link

[User Story] Reduce number of false positive SVs by constructing an artefact database #1316

Open mathiasbio opened 1 year ago

mathiasbio commented 1 year ago

Need

The number of SVs uploaded to Scout for WGS cases in balsamic is much higher than is common, and could really benefit from being reduced. As an example here are the some values from the most recent WGS T/N cases in balsamic: X: 3446 Y: 3752 Z: 3689 A: 3575 B: 3008 C: 3086 casenames: https://docs.google.com/document/d/1ZhiK9u7Ep5oHdAjpm0j85yTZey3Kzq2OtW8clig-I0A/edit

During the latest ILC: https://github.com/Clinical-Genomics/External-comparison/issues/37 There was a sort of early report on the comparisons between the different participants, and we really stood out on this as can be seen in this plot I screenshot during the presentation from the organizers. This was also WGS T/N cases:

ILCSVS

Suggested approach

The safest approach would probably to be to build a database of artefact SVs, and ideally -- to reduce the risk when using this database to filter real variants -- to build this database on calling on normal samples.

Although if the database is built on normal-samples there's a chance that artifact SVs that for whatever reason have a low AF will not be included in the database but present in the tumor samples due to the 3X differences in coverage depth for these different sample-types.

Perhaps the database should be built on tumor-samples after all, but we would need to make sure that we have included enough samples and from different cancer-types to reduce the risk of common cancer-variants reaching a high AF in the database that makes them indistinguishable from artefacts.

Considered alternatives

Possibly requiring that only variants that were found in 2 or more variant-callers would reduce the number of variants greatly. However there could be an increased risk here of filtering out true variants that a particular caller is uniquely good at calling. Perhaps some mix of the database and this could be made however!

Deviation

No response

Risk assessment

Risk assessment link

No response

System requirements assessed

Requirements affected by this story

No response

Can be closed when

No response

Blockers

https://github.com/Clinical-Genomics/BALSAMIC/issues/1376

Anything else?

No response

vwirta commented 1 year ago

Thanks Mathias I agree that we really need to understand why we call so many more variants compared to the other labs in the ILC.

Do you know if we have a detailed description of their SV calling and filtering approaches? With this large difference, we should be able to pinpoint at least some differences.

On 2 Nov 2023, at 10:44, Mathias Johansson @.***> wrote:

Need

The number of SVs uploaded to Scout for WGS cases in balsamic is much higher than is common, and could really benefit from being reduced. As an example here are the some values from the most recent WGS T/N cases in balsamic: X: 3446 Y: 3752 Z: 3689 A: 3575 B: 3008 C: 3086 casenames: https://docs.google.com/document/d/1ZhiK9u7Ep5oHdAjpm0j85yTZey3Kzq2OtW8clig-I0A/edit https://docs.google.com/document/d/1ZhiK9u7Ep5oHdAjpm0j85yTZey3Kzq2OtW8clig-I0A/edit During the latest ILC: Clinical-Genomics/External-comparison#37 https://github.com/Clinical-Genomics/External-comparison/issues/37 There was a sort of early report on the comparisons between the different participants, and we really stood out on this as can be seen in this plot I screenshot during the presentation from the organizers. This was also WGS T/N cases:

https://user-images.githubusercontent.com/14308912/279948613-a738c4b6-945d-4315-b7b1-0b2270cce735.png Suggested approach

The safest approach would probably to be to build a database of artefact SVs, and ideally -- to reduce the risk when using this database to filter real variants -- to build this database on calling on normal samples.

Although if the database is built on normal-samples there's a chance that artifact SVs that for whatever reason have a low AF will not be included in the database but present in the tumor samples due to the 3X differences in coverage depth for these different sample-types.

Perhaps the database should be built on tumor-samples after all, but we would need to make sure that we have included enough samples and from different cancer-types to reduce the risk of common cancer-variants reaching a high AF in the database that makes them indistinguishable from artefacts.

Considered alternatives

Possibly requiring that only variants that were found in 2 or more variant-callers would reduce the number of variants greatly. However there could be an increased risk here of filtering out true variants that a particular caller is uniquely good at calling. Perhaps some mix of the database and this could be made however!

Deviation

No response

Risk assessment

Needed Not needed Risk assessment link

No response

System requirements assessed

Yes, I have reviewed the system requirements Requirements affected by this story

No response

Can be closed when

No response

Blockers

No response

Anything else?

No response

— Reply to this email directly, view it on GitHub https://github.com/Clinical-Genomics/BALSAMIC/issues/1316, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUGEGO23ZAB5PDG4UU7MGLYCN2PFAVCNFSM6AAAAAA62VKYJKVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TGOJXGQ2TKOA. You are receiving this because you are subscribed to this thread.