Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
44 stars 16 forks source link

Remove MAX AF >= 1 filter from bcftools SNV and InDel filters in all workflows #1180

Closed mathiasbio closed 9 months ago

mathiasbio commented 1 year ago

Need

Background to this feature can be found here: https://github.com/Clinical-Genomics/BALSAMIC/issues/1166

In short the need is to remove this filter bcftools filter--include FORMAT/AF[0] < 1 --soft-filter balsamic_af_one --mode + which occurs in all workflows for filtering of SNVs and InDels, used here: in sentieon_quality_filter.rule

This should be removed because it's a risk, especially in very pure and somewhat lower-coverage tumor-samples, that a true somatic variant may reach 1 AF and be filtered out.

Suggested approach

The suggested approach is simply to remove it from the rule. However, we want to make sure first that the filter is not terribly important for filtering out false positives so that we can if necessary make a more sophisticated solution.

Considered alternatives

Were there alternative approaches which have been rejected?

Requests/suggestions/bugs solved by the feature

Can be closed when

Link the issues needed to be closed for this to be implemented

Blockers

Anything preventing this from happening?

mathiasbio commented 1 year ago

On raw VCFs generated from vardict and TNscope with the most recent version of balsamic 12.0.2 I have run bcftools the same way we do in the TGA and WGS workflow.

For TGA: singularity exec varcall_py3.sifbcftools filter --include 'INFO/AF < 1' --soft-filter 'balsamic_af_one' --mode '+' -o ${1}_balsamic_af_one.vcf $1

For WGS: singularity exec varcall_py3.sif bcftools filter --include 'FORMAT/AF[0] < 1' --soft-filter 'balsamic_af_one' --mode '+' -o ${1}_balsamic_af_one.vcf $1

analysis_type | filename | total variants | balsamic_af_one | percent balsamic_af_one -- | -- | -- | -- | -- WES-T | SNV.somatic.case.vardict.vcf.gz | 1416117 | 9435 | 0,67% WES-T | SNV.somatic.case.vardict.vcf.gz | 125365 | 22618 | 18,04% WES-T | SNV.somatic.case.vardict.vcf.gz | 193787 | 19590 | 10,11% TGA-T | SNV.somatic.case.vardict.vcf.gz | 39088 | 103 | 0,26% TGA-T | SNV.somatic.case.vardict.vcf.gz | 50362 | 94 | 0,19% TGA-T | SNV.somatic.case.vardict.vcf.gz | 38119 | 109 | 0,29% TGA-T | SNV.somatic.case.vardict.vcf.gz | 31539 | 146 | 0,46% TGA-TN | SNV.somatic.case.vardict.vcf.gz | 185420 | 112 | 0,06% WGS-T | SNV.somatic.case.tnscope.vcf.gz | 5751184 | 1735967 | 30,18% WGS-T | SNV.somatic.case.tnscope.vcf.gz | 5701450 | 1673594 | 29,35% WGS-TN | SNV.somatic.case.tnscope.vcf.gz | 186947 | 161 | 0,09% WGS-TN | SNV.somatic.case.tnscope.vcf.gz | 209264 | 189 | 0,09% WGS-TN | SNV.somatic.case.tnscope.vcf.gz | 170847 | 175 | 0,10% WGS-TN | SNV.somatic.case.tnscope.vcf.gz | 180961 | 165 | 0,09% Conclusion from this testing is that it seems that the main purpose of this filter has been to remove germline-variants from tumor-only analyses with low to moderately low coverage (WGS and WES). For tumor only TGA the coverage is probably so high that purely by chance it is unlikely to have exactly 100% of let's say 1000X reads in a position to have the same base. It would be nice to see what happens with these variants with AF = 1 downtream in the filtering in the tumor only analyses to see if they are filtered out by the population and LoqusDB databases. I see two scenarios here: 1. The majority of these variants in tumor only analyses are filtered out in downstream filters, at which point I would recommend removing this filter for all types of analyses! Tumor only and tumor + normal. 2. There's a lot of AF=1 variants left despite the downstream filters in the tumor only analyses, at which point we might need to consider implementing some additional filters before removing this in tumor only, but I still really don't like keeping this filter...but maybe it's necessary...but at least for the TN analyses this can be removed! In the original issue: https://github.com/Clinical-Genomics/BALSAMIC/issues/1166 I wrote that if the original idea with the filter was to remove variants with an illegal AF above 1 then the filter could be changed to `--exclude 'FORMAT/AF[0] > 1.0' --soft-filter balsamic_af_one --mode +` I also tested changing the original filter to this, and confirmed that no variants with an AF of 1 was filtered out, I modified manually a VCF to have an AF above 1, at which point it was filtered out by this filter. So if we want to keep the filter for the purpose of filtering out variants with an illegal AF changing the filter in this way would work, however, I did not see any real variant in either Vardict or TNscope with an AF larger than 1. So I think the question should be focused on either saving or keep removing the variants with AF = 1.
mathiasbio commented 1 year ago

I'd also like to note that in the last NIQAS3 the clinically relevant variant was a SNV with roughly 0.996 AF, which means it was very close to being filtered out with this filter.

vwirta commented 1 year ago

I'd also like to note that in the last NIQAS3 the clinically relevant variant was a SNV with roughly 0.996 AF, which means it was very close to being filtered out with this filter.

Thanks @mathiasbio for great summary. I agree in that this filter is something we really need to address. We should have a solid reasoning for why we include a filter.

Do you know anything about the NIQAS3 sample? For example, what was the tumour cell fraction in the sample? The reason I am asking is that I have really hard time to understand in what biological context a somatic variant could have VAF=1. It could of course be a measuring artefact (random sampling bias), but this should be very rare. A high VAF could also be explained by the presence of a CNA in the T sample for the region of the variant, but in this case the VAF should approach 1 (but never reach it).

mathiasbio commented 1 year ago

@vwirta I don't remember exactly! But I remember something about overlapping CNVs in the region, I could ask Fulya. I also think it's very rare that a somatic variant would have an AF of 1, but Kalle brought my attention to this problem to begin with based on some ILC (https://github.com/Clinical-Genomics/External-comparison/issues/22 I think) where we missed some 100% VAF. Probably happens very rarely, so we don't need to get too anxious, but it would be nice to be sure : )

vwirta commented 1 year ago

I've asked Fulya regarding the NIQAS3 case. Good point with the ILC. This was the one organised by Gustav Roussey, and I think it would be good to follow up with them as well regarding the missed variants. @AnnaLeinfelt Do you have the final report from the ILC? I can't find it in my email box.

AnnaLeinfelt commented 1 year ago

I've asked Fulya regarding the NIQAS3 case. Good point with the ILC. This was the one organised by Gustav Roussey, and I think it would be good to follow up with them as well regarding the missed variants. @AnnaLeinfelt Do you have the final report from the ILC? I can't find it in my email box.

No, I don't. I was in contact earlier and then the results was about to be submitted. I'll contact them again. This slipped my mind.

mathiasbio commented 9 months ago

Also wrote this in the PR: https://github.com/Clinical-Genomics/BALSAMIC/pull/1338

sequencing type T/TN case # PASS variants release 13 # PASS variants this PR # additional PASS variants % extra PASS variants
WGS T civilsole 33051 33051 0 0
WGS T firstviper 54072 54072 0 0
WGS TN fleetjay 7189 7190 1 0,01%
TGA T setamoeba 2415 2416 1 0,04%
TGA TN unitedbeagle 1573 1573 0 0
TGA UMI T (TNscope) equalbug 158 158 0 0
TGA UMI T (VarDict) equalbug 70 70 0 0
TGA UMI TN (TNscope) uphippo 124 124 0 0
TGA UMI TN (VarDict) uphippo 105 105 0 0

Note that likely many variants would have been added in the T-only WGS cases if at the same time the T-only WGS specific filter bcftools filter --threads {threads} --include 'FORMAT/ALT_F1R2 > {params.strand_reads[0]} && (FORMAT/ALT_F1R2 > 0 && FORMAT/ALT_F2R1 > {params.strand_reads[0]} && FORMAT/REF_F1R2 > {params.strand_reads[0]} && FORMAT/REF_F2R1 > {params.strand_reads[0]})' --soft-filter '{params.strand_reads[1]}' --mode '+' was also removed. But as this filter also requires that some reads support a reference variant, then removing the MAX_AF 1 filter in this analysis type has no real effect.

I think we can conclude with these stats that it is safe to remove the MAX AF filter, and we can postpone extending this fix to the WGS Tumor only analysis for later.

vwirta commented 9 months ago

Also wrote this in the PR: #1338

sequencing type T/TN case # PASS variants release 13 # PASS variants this PR # additional PASS variants % extra PASS variants WGS T civilsole 33051 33051 0 0 WGS T firstviper 54072 54072 0 0 WGS TN fleetjay 7189 7190 1 0,01% TGA T setamoeba 2415 2416 1 0,04% TGA TN unitedbeagle 1573 1573 0 0 TGA UMI T (TNscope) equalbug 158 158 0 0 TGA UMI T (VarDict) equalbug 70 70 0 0 TGA UMI TN (TNscope) uphippo 124 124 0 0 TGA UMI TN (VarDict) uphippo 105 105 0 0 Note that likely many variants would have been added in the T-only WGS cases if at the same time the T-only WGS specific filter bcftools filter --threads {threads} --include 'FORMAT/ALT_F1R2 > {params.strand_reads[0]} && (FORMAT/ALT_F1R2 > 0 && FORMAT/ALT_F2R1 > {params.strand_reads[0]} && FORMAT/REF_F1R2 > {params.strand_reads[0]} && FORMAT/REF_F2R1 > {params.strand_reads[0]})' --soft-filter '{params.strand_reads[1]}' --mode '+' was also removed. But as this filter also requires that some reads support a reference variant, then removing the MAX_AF 1 filter in this analysis type has no real effect.

I think we can conclude with these stats that it is safe to remove the MAX AF filter, and we can postpone extending this fix to the WGS Tumor only analysis for later.

Thanks @mathiasbio for a great summary again! I agree with your conclusion.

mathiasbio commented 9 months ago

merged into https://github.com/Clinical-Genomics/BALSAMIC/pull/1320 🥳 closing issue!