Open mathiasbio opened 1 year ago
Thanks Mathias I agree that we really need to understand why we call so many more variants compared to the other labs in the ILC.
Do you know if we have a detailed description of their SV calling and filtering approaches? With this large difference, we should be able to pinpoint at least some differences.
On 2 Nov 2023, at 10:44, Mathias Johansson @.***> wrote:
Need
The number of SVs uploaded to Scout for WGS cases in balsamic is much higher than is common, and could really benefit from being reduced. As an example here are the some values from the most recent WGS T/N cases in balsamic: X: 3446 Y: 3752 Z: 3689 A: 3575 B: 3008 C: 3086 casenames: https://docs.google.com/document/d/1ZhiK9u7Ep5oHdAjpm0j85yTZey3Kzq2OtW8clig-I0A/edit https://docs.google.com/document/d/1ZhiK9u7Ep5oHdAjpm0j85yTZey3Kzq2OtW8clig-I0A/edit During the latest ILC: Clinical-Genomics/External-comparison#37 https://github.com/Clinical-Genomics/External-comparison/issues/37 There was a sort of early report on the comparisons between the different participants, and we really stood out on this as can be seen in this plot I screenshot during the presentation from the organizers. This was also WGS T/N cases:
https://user-images.githubusercontent.com/14308912/279948613-a738c4b6-945d-4315-b7b1-0b2270cce735.png Suggested approach
The safest approach would probably to be to build a database of artefact SVs, and ideally -- to reduce the risk when using this database to filter real variants -- to build this database on calling on normal samples.
Although if the database is built on normal-samples there's a chance that artifact SVs that for whatever reason have a low AF will not be included in the database but present in the tumor samples due to the 3X differences in coverage depth for these different sample-types.
Perhaps the database should be built on tumor-samples after all, but we would need to make sure that we have included enough samples and from different cancer-types to reduce the risk of common cancer-variants reaching a high AF in the database that makes them indistinguishable from artefacts.
Considered alternatives
Possibly requiring that only variants that were found in 2 or more variant-callers would reduce the number of variants greatly. However there could be an increased risk here of filtering out true variants that a particular caller is uniquely good at calling. Perhaps some mix of the database and this could be made however!
Deviation
No response
Risk assessment
Needed Not needed Risk assessment link
No response
System requirements assessed
Yes, I have reviewed the system requirements Requirements affected by this story
No response
Can be closed when
No response
Blockers
No response
Anything else?
No response
— Reply to this email directly, view it on GitHub https://github.com/Clinical-Genomics/BALSAMIC/issues/1316, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUGEGO23ZAB5PDG4UU7MGLYCN2PFAVCNFSM6AAAAAA62VKYJKVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TGOJXGQ2TKOA. You are receiving this because you are subscribed to this thread.
Need
The number of SVs uploaded to Scout for WGS cases in balsamic is much higher than is common, and could really benefit from being reduced. As an example here are the some values from the most recent WGS T/N cases in balsamic: X: 3446 Y: 3752 Z: 3689 A: 3575 B: 3008 C: 3086 casenames: https://docs.google.com/document/d/1ZhiK9u7Ep5oHdAjpm0j85yTZey3Kzq2OtW8clig-I0A/edit
During the latest ILC: https://github.com/Clinical-Genomics/External-comparison/issues/37 There was a sort of early report on the comparisons between the different participants, and we really stood out on this as can be seen in this plot I screenshot during the presentation from the organizers. This was also WGS T/N cases:
Suggested approach
The safest approach would probably to be to build a database of artefact SVs, and ideally -- to reduce the risk when using this database to filter real variants -- to build this database on calling on normal samples.
Although if the database is built on normal-samples there's a chance that artifact SVs that for whatever reason have a low AF will not be included in the database but present in the tumor samples due to the 3X differences in coverage depth for these different sample-types.
Perhaps the database should be built on tumor-samples after all, but we would need to make sure that we have included enough samples and from different cancer-types to reduce the risk of common cancer-variants reaching a high AF in the database that makes them indistinguishable from artefacts.
Considered alternatives
Possibly requiring that only variants that were found in 2 or more variant-callers would reduce the number of variants greatly. However there could be an increased risk here of filtering out true variants that a particular caller is uniquely good at calling. Perhaps some mix of the database and this could be made however!
Deviation
No response
Risk assessment
Risk assessment link
No response
System requirements assessed
Requirements affected by this story
No response
Can be closed when
No response
Blockers
https://github.com/Clinical-Genomics/BALSAMIC/issues/1376
Anything else?
No response