Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
44 stars 16 forks source link

Increased number of SVs in versions after 9.0.1 #1118

Closed mathiasbio closed 1 year ago

mathiasbio commented 1 year ago

Is your feature request related to a problem? Please describe.

Not sure if this is a relevant issue or not. But I thought I would bring it up as a discussion.

Context: In a GMS-BT meeting case chiefgull (run with 11.2.0, a re-analysis of masterflea, run with 9.0.1) it was seen that the number of PASS variants in the final SV-vcf uploaded to Scout was increased from 197 to 8404.

This triggered a question of why the numbers had increased so significantly, and I learned that 8032 of the unique variants in this re-analysis came from TIDDIT which was added to the WGS flow in version 10.0.0 ((https://github.com/Clinical-Genomics/BALSAMIC/pull/947) )

To see if this was just an outlier I checked a few other cases before and after addition of TIDDIT. Below is a table summarising the number of variants in the final SV vcf with filter PASS (column 1) and PASS + TIDDIT (column2), for a few cases in version 9.0.1, 10.0.5 and 11.2.0 (the current latest version).

In summary in a lot of cases TIDDIT seems to add a lot of SVs.

9.0.1 PASS → Tiddit (0) PASS PASS + TIDDIT
fleetearwig 616 0
betterbeagle 662 0
exactmole 1059 0
fairant 781 0
likedguinea 222 0
notedstork 1871 0
uphornet 137 0
10.0.5 PASS → Tiddit
firmraptor 16832 13883
frankmagpie 14916 13497
dearboa 16385 15499
jointmako 14847 14597
crackbaboon 14473 14242
quickgoat 15489 15098
novelbream 19669 15212
11.2.0 (clinical sv vcf) PASS → Tiddit
expertsatyr 25508 1410
amplewasp 31941 1474
ableant 7153 7011
topsdonkey 8106 7959
suiteddrake 10958 8292
hardyweevil 8101 6739

In the VCF there is a value per variant about how many files this variant was observed in, taken probably from the SVDB merge step. But this value is not available to filter in Scout, nor any other quality-based metric to decrease the number of variants to a manageable amount to interpret.

Describe the solution you'd like

Either more filtering of the SV variants before upload to Scout, or more options for manual filtration in Scout, in which case we need to identify good parameters to filter by.

SOMATICSCORE which we're planning to introduce to Scout (https://github.com/Clinical-Genomics/BALSAMIC/issues/1107) is only available for variants called with Manta, and would not enable us to filter TIDDIT variants.

Describe alternatives you've considered

Is TIDDIT necessary? Why was it introduced?

Additional context If possible, add any other context or screenshots about the feature request here.

Expected output for the feature If possible, an example of expected output

Current BALSAMIC version balsamic --version 11.2.0

mathiasbio commented 1 year ago

I spoke to Jesper about TIDDIT and there were 2 large conclusions, with fairly simple implementations to probably significantly reduce the number of variants:

  1. Apparently we are calling SVs on both the normal and the tumor, but we are not doing any filtering of presence of these SV variants in the normal sample, and in essence we are just adding the normal variants to the tumor when the point is to use the normal variants to filter the somatic.
  2. For BNDs TIDDIT calls 2 variants for each mutation, sort of the forward and the reverse version of the variant. What this means is that we could choose one variant per mutation and probably remove a couple of thousand additional variants before upload to Scout.
fevac commented 1 year ago

Nice find! 🕵️

pbiology commented 1 year ago

Fixed with #1120