Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
44 stars 16 forks source link

SV variants "too long to annotate" from VEP are filtered out #1123

Closed mathiasbio closed 1 year ago

mathiasbio commented 1 year ago

Describe the bug

Looking at WGS tumor / normal case "notedshark", SV calls are disappearing during the "vep_somatic_sv" rule. The number of variants in the vcf-files before and after is different: vcf/SV.somatic.notedshark.svdb.research.vcf.gz variants: 16152 vep/SV.somatic.notedshark.svdb.research.vcf.gz variants: 15844

This is a difference of 308 variants, which exactly corresponds to the number of warnings in the vep_somatic_sv stderr output from balsamic. grep WARNING BALSAMIC.notedshark.vep_somatic_sv.5.sh_4554097.err | wc -l = 308

And looking for an individual variant from the list of warnings, such as, "MantaDUP:TANDEM:335100:0:2:0:0:0:manta|DUP00079774:dellysv" you can find in the VCF before but not after this vep_somatic_sv rule.

I can only conclude that these variants when having this warning in VEP are in essence filtered out. This should not be the intended behaviour, and seems very urgent to fix!

To Reproduce Steps to reproduce the behavior.

Expected behavior A clear and concise description of what you expected to happen.

If workflow, which rules If possible, and using the Snakemake workflows, the name of the affected rules and workflows.

Screenshots

image

Version (please complete the following information): balsamic --version 11.2.0

Additional context Perhaps this fix can be included in the new release of balsamic! https://github.com/orgs/Clinical-Genomics/projects/46 Perhaps added as an extra PR in this issue: https://github.com/Clinical-Genomics/BALSAMIC/issues/1119

mathiasbio commented 1 year ago

It seems that this issue has been discussed on https://github.com/Ensembl/ensembl-vep/issues/600 and been fixed in the release >109 to still output the variants that are too long to annotate, instead of just filtering them away.

Perhaps then that a simple first step is just to update VEP and this issue could be solved, and hopefully the update doesn't cause any other issues.

fevac commented 1 year ago

+1 VEP update

mathiasbio commented 1 year ago

Another option, perhaps better would be to simply update the --max_sv_size to the size of chromosome 1 as they have done in MIP, to 248956422. Then we'd not only perserve the variants in the vcf, but also have them annotated as intended!

There must have been a reason why max_sv_size was included as an option in the first place, and I think I read it had something to do with memory requirements for longer SVs as mentioned here https://github.com/Ensembl/ensembl-vep/pull/463.

It still seems like a good idea to increase the max_sv_size, if it works for MIP it probably works for us too. I think I will begin by testing if the max_sv_size solves the issue and doesn't cause memory issues.

mathiasbio commented 1 year ago

A discussion needs to be had whether or not it is preferred to annotate these long variants or keep them unannotated. As far as I understand it:

  1. Against annotation: For interpreting a specific case there is little use for knowing exactly which genes these enormous variants overlap.
  2. Against annotation: @khurrammaqbool mentioned something about the reports being messy in Scout if you include one such variant in the delivery report. (I may have misunderstood this)
  3. For annotation: If an in silico gene-panel is used for filtering SVs in Scout variants such as these which may overlap genes in the panel will be filtered out. Is there a clear choice in this matter?
mathiasbio commented 1 year ago

If we want to solve this issue in https://github.com/Clinical-Genomics/BALSAMIC/issues/1119 it seems that the fastest solution is simply to annotate variants even though they are larger than 10MB. Updating VEP may be too much work to verify that everything is working as it should.

I have spoken to Chiara and Daniel Nilsson about uploading variants which have been annotated up to the size of Chrom 1, and while it doesn't seem very useful to have these annotations, they did not consider the annotation itself problematic.

I will test uploading the SV for a WGS T / N case with including annotations for even the above 10MB size variants.

mathiasbio commented 1 year ago

After discussions it has been agreed to include this fix in the next release. The PR has been merged into develop, so I think I can close this issue now.