Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
44 stars 16 forks source link

TMB: Investigate how the calculation could be improved #912

Closed keyvanelhami closed 10 months ago

keyvanelhami commented 2 years ago

TMB calculation today in BALSAMIC

The current definition is based on https://github.com/Clinical-Genomics/BALSAMIC/issues/51 And the related snakemake rule https://github.com/Clinical-Genomics/BALSAMIC/blob/8244f837388404d236611349d9eac4cb094290d1/BALSAMIC/snakemake_rules/annotation/vep.rule#L60

How to improve the calculation

In attached article, a comparison between different labs with different lab and bioinfo methods and panel size has been made. The conclusion for improving TMB calculation, compared to WES data, is:

  1. Filtering pathogenic variants identified in COSMIC provides closer approximation to WES TMB.
  2. Panel size affects the positive percent agreement (PPA) of TMB calculation where larger panels are preferred.
  3. Gene content in the panel is a key factor
  4. Removing of synonymous variants and count only non-synonymous variants didn't significantly affect accuracy of TMB calculation.
  5. >0% pMAF (from e.g. gnomAD) provides closer approximation to WES TMB compared to >0.5% pMAF
  6. More than one population database may help reduce biases
  7. FFPE may impact TMB estimation by generating false positives.

Suggested changes in current calculation for increasing TMB accuracy

To be discussed

TMB_article copy.pdf

Documentation

[ ] For Customers, Include the description of how TMB is calculated and related references in the balsamic readthedocs

ashwini06 commented 2 years ago

TMB calc in hydra https://github.com/hydra-genetics/biomarker/blob/develop/workflow/scripts/tmb.py

keyvanelhami commented 2 years ago

@ashwini06 The TMB calc in hydra looks very messy. Do you have any info regarding how it differs compared to our calculation?

What I can see from rule tmb_calculation: we are performing (@ashwini06 correct me if I am wrong):

hassanfa commented 2 years ago

Re: from talking to Keyvan on Slack

I'd recommend against using fliter_vep unless you know exactly what you'd like to filter and run time is not an issue:

Although hydragenetics's script is messy, but at least it has full control over it by extracting exact information that is needed. I'd suggest to use hydragenetics code base or similar one that is well tested validated.

annagellerbring commented 2 years ago

As previously discussed on cancer meeting, I will place an order for the Seracare TMB samples that we already have in the lab: gDNA TMB Mix Scores 7, 9, 13, 20, 26. Available with more info here: https://atlas.scilifelab.se/production/lab/sample_handling/reference_samples/

What app tag should we use?

vwirta commented 2 years ago

@annagellerbring WGS for these will be expensive, so let's limit it a little bit. WGSPCFS120 for samples TMB 7, 13, and 26.

PANKTTR040 for all five samples. Baitset: GMCKsolid

annagellerbring commented 2 years ago

As previously discussed on cancer meeting, I will place an order for the Seracare TMB samples that we already have in the lab: gDNA TMB Mix Scores 7, 9, 13, 20, 26. Available with more info here: https://atlas.scilifelab.se/production/lab/sample_handling/reference_samples/

What app tag should we use?

@keyvanelhami what delivery should I select? "Analysis"?

keyvanelhami commented 2 years ago

@annagellerbring Yes I think analysis will be good

annagellerbring commented 2 years ago

Orders placed:

keyvanelhami commented 2 years ago

@ashwini06 and @hassanfa, Can you guys provide some feedbacks regarding how easy/hard it's to implement hydragenetics's TMB script to Balsamic?

hassanfa commented 2 years ago

Hydragenetic's developers are more suitable to answer that question. I have not used it beyond just checking the code base. I have my own scripts to handle/calculate TMB.

ashwini06 commented 2 years ago

@ashwini06 and @hassanfa, Can you guys provide some feedbacks regarding how easy/hard it's to implement hydragenetics's TMB script to Balsamic?

@keyvanelhami It is a python script. If we wanted to implement it in BALSAMIC, I think It is good to run that python script solely to understand the complexity and its behavior. The code looks like it requires some input files (artifact and background files). So maybe Jonas from Uppsala can help to get more details of that script.

On the other hand, aren't we sequencing the TMB validation samples soon? It is also good to run our existing TMB script to check how the calculation values look like in comparison to those samples' TMB scores.

khurrammaqbool commented 2 years ago

@keyvanelhami Looking at the script below is the summary of TMB calculation:

  1. The variants are filtered against DP, VD, AF, gnomAD and db1000 with some thresholds.
  2. FFPE_SNV_artifacts is database of recurrent SNVs for the sequencing/sample type and filter can be set to minimum number of observations for a particular SNV to be recurrently observed artifacts.
  3. Background panel contains standard deviation and median scores for allele frequency to be used to filter out variants based on the threshold.
  4. Non-synonymous variants are corrected against the correction factor.
  5. Non-synonymous sites include missense, stop gained and stop loss variants.
  6. Total TMB is sum of all filtered synonymous and non-synonymous variants corrected against the correction factor

A. We can generate the allele frequency and FFPE artifact databases, the rest is straight forward. B. We can fine tune the values for the thresholds according to the sequencing type.

Comment: The statistical part in the script needs correction.

Below are the threshold values: FFPE SNV observations = 1 DP = 200 VD = 10 AF = 0.05-0.45 gnomAD = 0.0001 db1000g = 0.0001 background sd = 5 Non-synonymous correction factor = 0.78 Non-synonymous and synonymous correction factor = 0.57

annagellerbring commented 2 years ago

Now ticket #807109 is ready for analysis according to Henning.

keyvanelhami commented 2 years ago

amplewasp (TMB-7-WGS): Analysis completed expertsatyr (TMB-13-WGS): Analysis ongoing eagerroughy (TMB-26-WGS): In queue for sequencing

Method description from SeraCare for the TMB calculation.

annagellerbring commented 2 years ago

amplewasp (TMB-7-WGS): Analysis completed expertsatyr (TMB-13-WGS): Analysis ongoing eagerroughy (TMB-26-WGS): In queue for sequencing

Method description from SeraCare for the TMB calculation.

All cases are sequenced and analyzed now according to HO.

keyvanelhami commented 2 years ago

Summary of TMB analysis in Balsamic. The TMB score from Balsamic is taken from file *tnscope.balsamic_stat

Case name  TMB score SeraCare  TMB score Balsamic
amplewasp (TMB-7-WGS) 7 36
expertsatyr (TMB-13-WGS) 13 33
eagerroughy (TMB-26-WGS) 26 74
keyvanelhami commented 2 years ago

@khurrammaqbool will take lead on this Issue

pbiology commented 1 year ago

Part of #1108

khurrammaqbool commented 10 months ago

This issue is summarised in https://github.com/Clinical-Genomics/BALSAMIC/issues/1108, so closing it.