TMB: Investigate how the calculation could be improved

keyvanelhami commented 2 years ago

TMB calculation today in BALSAMIC

The current definition is based on https://github.com/Clinical-Genomics/BALSAMIC/issues/51 And the related snakemake rule https://github.com/Clinical-Genomics/BALSAMIC/blob/8244f837388404d236611349d9eac4cb094290d1/BALSAMIC/snakemake_rules/annotation/vep.rule#L60

How to improve the calculation

In attached article, a comparison between different labs with different lab and bioinfo methods and panel size has been made. The conclusion for improving TMB calculation, compared to WES data, is:

Filtering pathogenic variants identified in COSMIC provides closer approximation to WES TMB.
Panel size affects the positive percent agreement (PPA) of TMB calculation where larger panels are preferred.
Gene content in the panel is a key factor
Removing of synonymous variants and count only non-synonymous variants didn't significantly affect accuracy of TMB calculation.
>0% pMAF (from e.g. gnomAD) provides closer approximation to WES TMB compared to >0.5% pMAF
More than one population database may help reduce biases
FFPE may impact TMB estimation by generating false positives.

Suggested changes in current calculation for increasing TMB accuracy

To be discussed

TMB_article copy.pdf

Documentation

[ ] For Customers, Include the description of how TMB is calculated and related references in the balsamic readthedocs

ashwini06 commented 2 years ago

TMB calc in hydra https://github.com/hydra-genetics/biomarker/blob/develop/workflow/scripts/tmb.py

keyvanelhami commented 2 years ago

@ashwini06 The TMB calc in hydra looks very messy. Do you have any info regarding how it differs compared to our calculation?

What I can see from rule tmb_calculation: we are performing (@ashwini06 correct me if I am wrong):

Calculating the size of the panel with awk and put it in variable region_size
Annotating the raw tumor vcf-file with bcftools annotate
Selecting snps,indels only with bcftools view --types
Filtering >5% VAFs with bcftools filter
Filtering variants with filter_vep --filter:
- not Existing_variation
- not COSMIC
- not non_coding_transcript_exon_variant
- not non_coding_transcript_variant
- not feature_truncation
Calculating TMB with awk by diving number of variants in the filtered vcf (number of fields = $NF) with region_size

hassanfa commented 2 years ago

Re: from talking to Keyvan on Slack

I'd recommend against using fliter_vep unless you know exactly what you'd like to filter and run time is not an issue:

there are two cosmic annotations: one done by balsamic and one by Vep. Filter_vep always takes prio from vep.l, cause vep is in csq.
existing variation is extremely vague. It will also any variant that has a rsID or dbsnp membership.
filter_vep is very slow! And does not support multithreading. Cyvcf2 or equivalent can be faster.
there are also two gnomad annotations in balsamic. Lots of pipes are needed to achieve a multi-layered filtering to take the correct value of the two.

Although hydragenetics's script is messy, but at least it has full control over it by extracting exact information that is needed. I'd suggest to use hydragenetics code base or similar one that is well tested validated.

annagellerbring commented 2 years ago

As previously discussed on cancer meeting, I will place an order for the Seracare TMB samples that we already have in the lab: gDNA TMB Mix Scores 7, 9, 13, 20, 26. Available with more info here: https://atlas.scilifelab.se/production/lab/sample_handling/reference_samples/

What app tag should we use?

vwirta commented 2 years ago

@annagellerbring WGS for these will be expensive, so let's limit it a little bit. WGSPCFS120 for samples TMB 7, 13, and 26.

PANKTTR040 for all five samples. Baitset: GMCKsolid

annagellerbring commented 2 years ago

As previously discussed on cancer meeting, I will place an order for the Seracare TMB samples that we already have in the lab: gDNA TMB Mix Scores 7, 9, 13, 20, 26. Available with more info here: https://atlas.scilifelab.se/production/lab/sample_handling/reference_samples/

What app tag should we use?

@keyvanelhami what delivery should I select? "Analysis"?

keyvanelhami commented 2 years ago

@annagellerbring Yes I think analysis will be good

annagellerbring commented 2 years ago

Orders placed:

WGS-order: 807109
TGA-order: 568241

keyvanelhami commented 2 years ago

@ashwini06 and @hassanfa, Can you guys provide some feedbacks regarding how easy/hard it's to implement hydragenetics's TMB script to Balsamic?

hassanfa commented 2 years ago

Hydragenetic's developers are more suitable to answer that question. I have not used it beyond just checking the code base. I have my own scripts to handle/calculate TMB.

ashwini06 commented 2 years ago

@ashwini06 and @hassanfa, Can you guys provide some feedbacks regarding how easy/hard it's to implement hydragenetics's TMB script to Balsamic?

@keyvanelhami It is a python script. If we wanted to implement it in BALSAMIC, I think It is good to run that python script solely to understand the complexity and its behavior. The code looks like it requires some input files (artifact and background files). So maybe Jonas from Uppsala can help to get more details of that script.

On the other hand, aren't we sequencing the TMB validation samples soon? It is also good to run our existing TMB script to check how the calculation values look like in comparison to those samples' TMB scores.

khurrammaqbool commented 2 years ago

@keyvanelhami Looking at the script below is the summary of TMB calculation:

The variants are filtered against DP, VD, AF, gnomAD and db1000 with some thresholds.
FFPE_SNV_artifacts is database of recurrent SNVs for the sequencing/sample type and filter can be set to minimum number of observations for a particular SNV to be recurrently observed artifacts.
Background panel contains standard deviation and median scores for allele frequency to be used to filter out variants based on the threshold.
Non-synonymous variants are corrected against the correction factor.
Non-synonymous sites include missense, stop gained and stop loss variants.
Total TMB is sum of all filtered synonymous and non-synonymous variants corrected against the correction factor

A. We can generate the allele frequency and FFPE artifact databases, the rest is straight forward. B. We can fine tune the values for the thresholds according to the sequencing type.

Comment: The statistical part in the script needs correction.

Below are the threshold values: FFPE SNV observations = 1 DP = 200 VD = 10 AF = 0.05-0.45 gnomAD = 0.0001 db1000g = 0.0001 background sd = 5 Non-synonymous correction factor = 0.78 Non-synonymous and synonymous correction factor = 0.57

annagellerbring commented 2 years ago

Now ticket #807109 is ready for analysis according to Henning.

keyvanelhami commented 2 years ago

amplewasp (TMB-7-WGS): Analysis completed expertsatyr (TMB-13-WGS): Analysis ongoing eagerroughy (TMB-26-WGS): In queue for sequencing

Method description from SeraCare for the TMB calculation.

annagellerbring commented 2 years ago

amplewasp (TMB-7-WGS): Analysis completed expertsatyr (TMB-13-WGS): Analysis ongoing eagerroughy (TMB-26-WGS): In queue for sequencing

Method description from SeraCare for the TMB calculation.

All cases are sequenced and analyzed now according to HO.

keyvanelhami commented 2 years ago

Summary of TMB analysis in Balsamic. The TMB score from Balsamic is taken from file *tnscope.balsamic_stat

Case name	TMB score SeraCare	TMB score Balsamic
amplewasp (TMB-7-WGS)	7	36
expertsatyr (TMB-13-WGS)	13	33
eagerroughy (TMB-26-WGS)	26	74

keyvanelhami commented 2 years ago

@khurrammaqbool will take lead on this Issue

pbiology commented 1 year ago

Part of #1108

khurrammaqbool commented 10 months ago

This issue is summarised in https://github.com/Clinical-Genomics/BALSAMIC/issues/1108, so closing it.

Clinical-Genomics / BALSAMIC