AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 68 forks source link

Updated analysis: investigate putative functionality of TP53 SNV and CNV calls #837

Closed jharenza closed 3 years ago

jharenza commented 3 years ago

What analysis module should be updated and why?

tp53_nf1_score was never fully completed because of bandwidth.

Since we would like to assess whether or not TP53 alterations are likely functional within #807, it seems like a better place to put this would be within the tp53_nf1_score module.

For reference of where this was left, see this PR comment by @jaclyn-taroni and this comment by @sjspielman and #720

What changes need to be made? Please provide enough detail for another participant to make the update.

Referencing #807, we should assess the likely functionality of TP53 alterations before calling them "altered" or "wildtype" for AUROC assessment of the classifier.

First

QC: Perform a correlation between RNA-Seq expression values and TP53 classifier scores. Are these inversely correlated as we would expect?

Second

Reduce the CNV list by focusing on CNVs which delete one or more of TP53's functional domains.

Something to keep in mind for samples which may have low TP53 scores, but have alterations: TP63 and TP73 are homologues which can be functionally redundant and rarely mutated, so if in tact, these might compensate. In addition, for samples which may have high TP53 scores but no alterations, we can check for MDM2 amplification, TP53's most potent negative regulator. For reference from the above paper:

The p63-AD1 (residues 1–59) and p73-AD1 (residues 1–54) are 22 and 29% identical to AD1 in p53, respectively

The p63-DBD (residues 142–321) and p73-DBD (residues 131–310) are 60 and 63% identical to the p53-DBD

Although the p63 and p73 DBD's do not appear to be mutated in human cancer, missense mutation of the p63-DBD is associated with several autosomal dominantly inherited syndromes.

For high-affinity DNA binding and transcriptional activation, p53 must be in the tetrameric form. Tetramerization of p53 is mediated through the TD (residues 326–356). The p63-TD (residues 360–390) and p73-TD (residues 353–383) are 39 and 42% identical to that in p53, respectively...Although the TD is not a mutational hotspot, mutation of this domain has been found to be causative of Li-Fraumeni syndrome in some families...Thus, it appears that tetramerization is essential for p53 to function as a tumor suppressor.

Third

We can start by annotating TP53 altered - loss, if the following conditions are met:

  1. A sample contains a TP53 hotspot SNV mutation. (Cancer hotspot database and downloadable file available). Please also crosscheck that all mutations from this table are included.
  2. A sample contains two TP53 alterations, suggesting (but not confirming) that both alleles are affected (SNV+SNV, CNV+SNV).
  3. A sample contains one alteration (SNV or CNV) + has cancer_predispositions == "Li-Fraumeni syndrome", suggesting there is a germline variant in addition to the somatic variant we observe.
  4. A sample does not have a TP53 alterations, but has cancer_predispositions == "Li-Fraumeni syndrome" and TP53 classifier score for matched RNA-Seq > 0.5 (or higher cutoff we decide upon later).

Fourth

We can annotate TP53 altered - activated if a sample contains one of the two TP53 activating mutations R273C and R248W. Reference and reference.

Fifth

Either assess and potentially annotate as TP53 altered or perform AUROC on above samples, then assess the below:

What input data should be used? Which data were used in the version being updated?

consensus_seg_annotated_cn_autosomes.tsv.gz
consensus_seg_annotated_cn_x_and_y.tsv.gz
pbta-gene-expression-rsem-tpm.polya.rds
pbta-gene-expression-rsem-tpm.stranded.rds
pbta-snv-consensus-mutation.maf.tsv.gz
pbta-histologies.tsv
TP53 classifier scores

When do you expect the revised analysis will be completed?

2-2.5 weeks?

Who will complete the updated analysis?

@kgaonkar6, @jharenza will review throughout

kgaonkar6 commented 3 years ago

I'm looking into the domains for TP53 for step 2 in biomart+pfam and this is the domains and locations I've found : P53_TAD (PF08563) ,P53 (PF00870) and P53_tetramer( PF07710) but we don't have genomic location info for TAD2 (PF18521) from pfam.

bioMartDataPfam %>% dplyr::filter(hgnc_symbol=="TP53")
  hgnc_symbol pfam_id chromosome_name gene_start gene_end strand         NAME
1        TP53 PF08563              17    7661779  7687550     -1      P53_TAD
2        TP53 PF00870              17    7661779  7687550     -1          P53
3        TP53 PF07710              17    7661779  7687550     -1 P53_tetramer
4        TP53 PF18521              17    7661779  7687550     -1         <NA>
5        TP53                      17    7661779  7687550     -1         <NA>
                       DESC domain_chr domain_start domain_end
1 P53 transactivation motif         17      7676390    7676582
2    P53 DNA-binding domain         17      7673755    7676387
3 P53 tetramerisation motif         17      7670637    7673573
4                      <NA>       <NA>           NA         NA
5                      <NA>       <NA>           NA         NA

pfam location/name info was obtained from: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/pfamDesc.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ucscGenePfam.txt.gz

it also seems like they don't have TAD2 in cbioportal rendering as well:

Screen Shot 2020-11-11 at 10 57 38 AM

should we continue with only the 3 domain above or include PF18521 and find the genomic location for the domain?

jharenza commented 3 years ago

@kgaonkar6 since this gene is on the reverse strand, the start and end locations are actually reversed and it looks like the PFAM database is calling the TAD domain as one domain instead of two, which may be a recent (last 5+ years) discovery. The TAD domain ends at 7676390 and the DBD starts 3 bp later at 7676387, so I think it is safe to use the TAD domain here as one domain.

kgaonkar6 commented 3 years ago

First QC: Perform a correlation between RNA-Seq expression values and TP53 classifier scores. Are these inversely correlated as we would expect?

Since this ticket is open, I 'm adding a comment here about QCing stranded vs polya samples as part of ⤴️ according to comment

jharenza commented 3 years ago

Wanted to make a note of this observation (potentially for discussion in the paper).

Within the tp53_nf1_score module, we tested whether patients with Li Fraumeni Syndrome (LFS) have high TP53 classifier scores and with the exception of two patients, all had ver high scores >= 0.70. I wanted to investigate whether these samples had a germline TP53 alteration and whether they truly have been diagnosed with LFS. For the former, @Yiran Guo found the variants below. For the latter, Jenn Mason and Shannon Robbins are going to attempt to track down that information from the sites.

Screen Shot 2021-02-15 at 3 44 44 PM
sample_id Kids_First_Participant_ID Kids_First_Biospecimen_ID_Tumor_DNA Kids_First_Biospecimen_ID_RNA Kids_First_Biospecimen_ID_Normal_DNA cancer_predispositions path report Germline VAF link
7316-2310 PT_PFP1ZVHD BS_Z9PKZ4RT BS_DEHJF4C7 BS_5FP2H6VW Li-Fraumeni syndrome not mentioned Likely pathogenic NM_000546.6(TP53):c.541C>A (p.Arg181Ser) 0.5 ClinVar https://www.ncbi.nlm.nih.gov/clinvar/variation/230764/
7316-445 PT_89XRZBSG BS_G9MQM1KK BS_ZD5HN296 BS_XHT3F34T Li-Fraumeni syndrome not mentioned Pathogenic NM_000546.5(TP53):c.454_466del (p.Pro152fs) 0.35 ClinVar https://www.ncbi.nlm.nih.gov/clinvar/variation/231540/

Interestingly, both have deleterious germline variants, neither have somatic variants (that we have found), but TP53 scores are very low, indicating functional/non-oncogenic TP53. BS_Z9PKZ4RT and BS_G9MQM1KK also have 2 copies of TP53 and both germline variants are heterozygous. Is the other copy still functional?

Germline.P_LP.PT_89XRZBSG.txt Germline.P_LP.PT_PFP1ZVHD.txt

Update: no germline or somatic SVs for these two tumors.

jharenza commented 3 years ago

An additional note on the LFS patients above. The tumor purity is very low in both of these samples, and this may be why we are both missing a second somatic hit in TP53 and seeing poor classification / low scores.

# A tibble: 2 x 2
  Kids_First_Biospecimen_ID tumor_fraction
  <chr>                              <dbl>
1 BS_G9MQM1KK                        0.165
2 BS_Z9PKZ4RT                        0.374
kgaonkar6 commented 3 years ago

@jharenza we did discuss looking into SV for samples with high TP53 classifier scores and no SNV/CNV should that be another ticket or should we update this one?

jharenza commented 3 years ago

@jharenza we did discuss looking into SV for samples with high TP53 classifier scores and no SNV/CNV should that be another ticket or should we update this one?

Let's make two new tickets to 1) annotate SVs as another alteration and 2) look at the samples with high scores that don't have TP53 alterations -for this, we'd look at genes upstream of TP53 and determine if they have alterations, for eg- MDM2 amplification. For 2), I think we may hold off on this for the first submission, but want to capture in ticket.

kgaonkar6 commented 3 years ago

@jharenza we did discuss looking into SV for samples with high TP53 classifier scores and no SNV/CNV should that be another ticket or should we update this one?

Let's make two new tickets to 1) annotate SVs as another alteration and 2) look at the samples with high scores that don't have TP53 alterations -for this, we'd look at genes upstream of TP53 and determine if they have alterations, for eg- MDM2 amplification. For 2), I think we may hold off on this for the first submission, but want to capture in ticket.

I created a ticket #953 for point 1 in the above comment, please update if I have missed something. Thanks!

jharenza commented 3 years ago

This was completed with #841 , #922 , #945