cio-abcd / variantinterpretation

Collaborative Interpretation-Pipeline workflow based on nf-core pipeline structure
MIT License
7 stars 1 forks source link

TMB calculation module for Tumor-Only-Sequencing #9

Open biolancer opened 1 year ago

biolancer commented 1 year ago

I would have a proposal for a TMB calculation module. It assumes tumor-only sequencing and requires only a VCF and BED file as input and works following this procedure:

The final TMB score would then be Eligible Variants / Effective panel size (in Mutations per MBp). The whole procedure follows the current implementation of the TSO500 RunManager app for TMB calculation and sounds reasonable to me.

Originally posted by @biolancer in https://github.com/cio-abcd/variantinterpretation/issues/5#issuecomment-1466003720

sci-kai commented 1 year ago

Sounds like a good plan for the beginning, I have some recommendations:

biolancer commented 1 year ago

1 & 2) Good point, the cutoffs should be set in the config file indeed, I could also set the upper boundaries as a changable variable to allow for a more specific tuning of the eligible mutations. The logic behind the 90 % AF cutoff is based on the assumption that in the case of a total or partial loss of WT alleles resulting in minor AF > 50 %, the final AF should not exceed 90 % as this would require a sample purity above 95 % as a given -- which I would strongly debate for all bulk sequencing methods.

3) Yes, sounds good. I liked the TMB presentation a lot and wanted to recreate something comparable.

4) I am not really sure, to be honest, and wouldn't know how to check if that would be the case. Coverage-, bed-based- and AF-base filtering routines are regularly implemented in TMB calculations, so is it even possible to claim it's the same algorithm or would it be something different if I leave out a filtering step? At no point will I be reusing proprietary code or anything prewritten, I would only follow the same filtering routines during datawrangling (we would be changing the procedure either way since we set variables instead of fixed values), but I am open to propositions for changes to the routine.

sci-kai commented 1 year ago

2) If the 90 % threshold also depends on the sample purity, maybe this should be documented. Also if this is a parameter one wants to adjust for each sample, it could depend on the sample purity as optional parameter in the samplesheet. However, I think a fixed value for each sample would be a more reasonable default, because the purity is always an estimate and low purities in samples (which frequently occurr) may filters too many variants. 4) I think you are right, the filtering criteria as AF, coverage, counts etc. should not be a problem. However, I am not sure about the procedure for checking ratios of filtered/unfiltered variants as this is not commonly used. This is a QC step and as suggested we add out own QC to the database.

Afterwards, it bins mutations with comparable allele frequency across each genome and generates a ratio of filtered to unfiltered mutations for each bin. If the ratio for a bin favors filtering and has at least 5 mutations marked for filtering, other eligible mutations in said bin will also be filtered.

biolancer commented 1 year ago

Alright. I will set up the module to have both an upper and lower bound as "hard-filter" boundaries until we implement "tumor purity" as a potential metadata column to the samplesheet and will leave out the QC based filter for now, as it thus also remains compatible for later tumor-normal-pair input.

biolancer commented 1 year ago

Since the TMB calculation requires a BED-file as input, a BED-file structural integrity check will be implemented. The integrity check will check for compatibility with bcftools filter and the TMB calculation routine.

sci-kai commented 9 months ago

We implemented an initial module for TMB calculation. We had several ideas for features to this module. I will collect them here and leave this issue open for further development. Features:

biolancer commented 7 months ago

As a further enhancement and to increase readability, the reporting of false entries in the bedfile should follow the same nomenclature and reporting structure as the VCF check.