TMB calculation module for Tumor-Only-Sequencing

cio-abcd / variantinterpretation

Collaborative Interpretation-Pipeline workflow based on nf-core pipeline structure

MIT License

7 stars 1 forks source link

TMB calculation module for Tumor-Only-Sequencing #9

Open biolancer opened 1 year ago

biolancer commented 1 year ago

I would have a proposal for a TMB calculation module. It assumes tumor-only sequencing and requires only a VCF and BED file as input and works following this procedure:

The annotated VCF will be checked for mutations with an population specific allele count above 10 in the gnomAD database.
Strict filter routines: >= 5 % AF, at least 50X Coverage, must be in the targeted assay area
Afterwards, it bins mutations with comparable allele frequency across each genome and generates a ratio of filtered to unfiltered mutations for each bin.
If the ratio for a bin favors filtering and has at least 5 mutations marked for filtering, other eligible mutations in said bin will also be filtered.
In a last step, mutations above 90 % AF will also be filtered.

The final TMB score would then be Eligible Variants / Effective panel size (in Mutations per MBp). The whole procedure follows the current implementation of the TSO500 RunManager app for TMB calculation and sounds reasonable to me.

Originally posted by @biolancer in https://github.com/cio-abcd/variantinterpretation/issues/5#issuecomment-1466003720

sci-kai commented 1 year ago

Sounds like a good plan for the beginning, I have some recommendations:

I recommend to have all filter criteria not hard-coded, but set as default in nextflow.config. It should be possible to enable/disable each filter criteria through configuration parameters to allow adjustments if necessary. For example, some labs use lower AF threshold of >=3 % or >=1 %.
I recommend to add detailed reasoning for each default filter criteria (apart from Illumina standard procedure). These sound reasonable, but any user need to know why a particular filtering is performed and what would be reasonable adjustments. For example, the last step of filtering mutations with >90 % AF is unclear to me, especially why choosing 90 % and not, e.g., 80 %?
The module should produce a report with statistics about how many variants are filtered due to which criteria. Ideally, a visualization that allows evaluation of every cut-offs would be nice, similarly to the TMB calculation module of Johannes Kösters Varlociraptor workflow. This can be, e.g., a histogram of AF for all variants.
Also, if orientating at Illuminas documentation, is there any conflict or concerns with Illuminas License, if we reimplement the same algorithm?

biolancer commented 1 year ago

1 & 2) Good point, the cutoffs should be set in the config file indeed, I could also set the upper boundaries as a changable variable to allow for a more specific tuning of the eligible mutations. The logic behind the 90 % AF cutoff is based on the assumption that in the case of a total or partial loss of WT alleles resulting in minor AF > 50 %, the final AF should not exceed 90 % as this would require a sample purity above 95 % as a given -- which I would strongly debate for all bulk sequencing methods.

3) Yes, sounds good. I liked the TMB presentation a lot and wanted to recreate something comparable.

4) I am not really sure, to be honest, and wouldn't know how to check if that would be the case. Coverage-, bed-based- and AF-base filtering routines are regularly implemented in TMB calculations, so is it even possible to claim it's the same algorithm or would it be something different if I leave out a filtering step? At no point will I be reusing proprietary code or anything prewritten, I would only follow the same filtering routines during datawrangling (we would be changing the procedure either way since we set variables instead of fixed values), but I am open to propositions for changes to the routine.

sci-kai commented 1 year ago

2) If the 90 % threshold also depends on the sample purity, maybe this should be documented. Also if this is a parameter one wants to adjust for each sample, it could depend on the sample purity as optional parameter in the samplesheet. However, I think a fixed value for each sample would be a more reasonable default, because the purity is always an estimate and low purities in samples (which frequently occurr) may filters too many variants. 4) I think you are right, the filtering criteria as AF, coverage, counts etc. should not be a problem. However, I am not sure about the procedure for checking ratios of filtered/unfiltered variants as this is not commonly used. This is a QC step and as suggested we add out own QC to the database.

Afterwards, it bins mutations with comparable allele frequency across each genome and generates a ratio of filtered to unfiltered mutations for each bin. If the ratio for a bin favors filtering and has at least 5 mutations marked for filtering, other eligible mutations in said bin will also be filtered.

biolancer commented 1 year ago

Alright. I will set up the module to have both an upper and lower bound as "hard-filter" boundaries until we implement "tumor purity" as a potential metadata column to the samplesheet and will leave out the QC based filter for now, as it thus also remains compatible for later tumor-normal-pair input.

biolancer commented 1 year ago

Since the TMB calculation requires a BED-file as input, a BED-file structural integrity check will be implemented. The integrity check will check for compatibility with bcftools filter and the TMB calculation routine.

sci-kai commented 9 months ago

We implemented an initial module for TMB calculation. We had several ideas for features to this module. I will collect them here and leave this issue open for further development. Features:

Additional filters
- Consequence filter: Only consider variants with specific consequence annotations. This must also adjust or check+warn for the panel size to the reasonable regions, e.g. only coding regions.
  - Note: Here the TSV-based calculation method needs to be adapted for handling different transcripts and transcript selections.
- target region filter: Checking and filtering variants outside the specified target regions.
Handling of more complex variants:
- MNVs: maybe add option to atomize MNVs and count them as single SNVs if reasonable?
- Improve the calculation based on input BAM files, accounting for variants on locally amplified or lost copy-number segments as further QC for allele frequencies of variants, influencing the filtering thresholds.
Additional checks
- add sample purity as metadata into samplesheet and calculate if given AF thresholds are reasonable for this sample purity.
Improving visualization and reporting
- Make the plot interactive
- Implement results from TMB calculation into the HTML report.

biolancer commented 7 months ago

As a further enhancement and to increase readability, the reporting of false entries in the bedfile should follow the same nomenclature and reporting structure as the VCF check.