AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
101 stars 67 forks source link

#959 Tp53 classifier rerun for subtyping #1073

Closed kgaonkar6 closed 3 years ago

kgaonkar6 commented 3 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

Update tp53-nf1-score with hotspots maf + consensus maf and latest consensus CNV file.

What was your approach?

What GitHub issue does your pull request address?

1072

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Should we update the TP53 CNV loss filtering process?

We had previously (cnvkit copy_number as default) only retained TP53 CNV losses with 1 or 0 copies because we had a lot of copy_number ==2 as loss (from neutral calls being assigned 2 copy).

After the update to use Freec as default for copy_number we see some changes where the copy_number has mostly changed to 2 for samples with >=3 ploidy which is a loss compared to the ploidy but we are missing them out because of our filter to use 1 or 0 copy loss calls only. Here's a snippet of the TP53 loss calls that are missed :

> overlaps[overlaps$Kids_First_Biospecimen_ID %in% c("BS_2J4FG4HV","BS_5JC116NM","BS_7M7JNG00","BS_823V5X6Z","BS_ZV21J6YW"),] 
DataFrame with 15 rows and 13 columns
                   cnv_gr Kids_First_Biospecimen_ID copy_number tumor_ploidy      status               domain_gr hgnc_symbol     pfam_id gene_start  gene_end         NAME
                <GRanges>               <character>   <numeric>    <numeric> <character>               <GRanges> <character> <character>  <integer> <integer>  <character>
1   chr17:114696-21459816               BS_2J4FG4HV           2            3        loss chr17:7670637-7673573:-        TP53     PF07710    7661779   7687550 P53_tetramer
2   chr17:114696-21459816               BS_2J4FG4HV           2            3        loss chr17:7673755-7676387:-        TP53     PF00870    7661779   7687550          P53
3   chr17:114696-21459816               BS_2J4FG4HV           2            3        loss chr17:7676390-7676582:-        TP53     PF08563    7661779   7687550      P53_TAD
4   chr17:113884-21633950               BS_5JC116NM           2            3        loss chr17:7670637-7673573:-        TP53     PF07710    7661779   7687550 P53_tetramer
5   chr17:113884-21633950               BS_5JC116NM           2            3        loss chr17:7673755-7676387:-        TP53     PF00870    7661779   7687550          P53
...                   ...                       ...         ...          ...         ...                     ...         ...         ...        ...       ...          ...
11  chr17:342848-19056601               BS_823V5X6Z           2            3        loss chr17:7673755-7676387:-        TP53     PF00870    7661779   7687550          P53
12  chr17:342848-19056601               BS_823V5X6Z           2            3        loss chr17:7676390-7676582:-        TP53     PF08563    7661779   7687550      P53_TAD
13  chr17:112392-15791752               BS_ZV21J6YW           1            2        loss chr17:7670637-7673573:-        TP53     PF07710    7661779   7687550 P53_tetramer
14  chr17:112392-15791752               BS_ZV21J6YW           1            2        loss chr17:7673755-7676387:-        TP53     PF00870    7661779   7687550          P53
15  chr17:112392-15791752               BS_ZV21J6YW           1            2        loss chr17:7676390-7676582:-        TP53     PF08563    7661779   7687550      P53_TAD
                         DESC  domain_chr
                  <character> <character>
1   P53 tetramerisation motif          17
2      P53 DNA-binding domain          17
3   P53 transactivation motif          17
4   P53 tetramerisation motif          17
5      P53 DNA-binding domain          17
...                       ...         ...
11     P53 DNA-binding domain          17
12  P53 transactivation motif          17
13  P53 tetramerisation motif          17
14     P53 DNA-binding domain          17
15  P53 transactivation motif          17

This is the distribution of the TP53 loss calls image

Is there anything that you want to discuss further?

Can we add consensus_seg_with_status.tsv as output in focal-cn-file-preparation module so I don't have to run the script here?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

tables

What is your summary of the results?

tp53_alt_status_change.txt

sample_id Kids_First_Biospecimen_ID_DNA Kids_First_Biospecimen_ID_RNA cancer_predispositions tp53_score SNV_indel_counts_latest CNV_loss_counts_latest SV_counts_latest HGVSp_Short_latest CNV_loss_evidence_latest SV_type_latest hotspot_latest activating_latest tp53_altered_latest
7316-2189 BS_02YBZSBY BS_HJRTC9JQ Other inherited conditions NOS 0.66189561 2 0 0 p.R306*, p.R273C NA NA 1 1 activated
7316-2753 BS_WDTT7PG2 BS_YMAJC22S None documented 0.68040047 1 0 0 p.Y236Hfs*8 NA NA 0 0 loss
7316-1746 BS_68TZMZH1 BS_0RQ4P069 None documented 0.47555198 1 0 0 p.Y163C NA NA 1 0 loss
7316-3631 BS_ST3Z2B9B BS_NGHK9RZP None documented 0.16753643 2 0 0 p.X261_splice, p.X307_splice NA NA 0 0 loss
7316-3920 BS_E0S2Y0TS NA None documented NA 2 0 0 p.X261_splice, p.X307_splice NA NA 0 0 loss
7316-901 BS_1JGQPJH3 BS_A3QZB9Y2 None documented 0.32396022 2 0 0 p.X187_splice, p.X261_splice NA NA 0 0 loss
7316-3221 BS_FK3B5SDH NA NA NA 1 0 0 p.L265P NA NA 1 0 loss
sample_id Kids_First_Biospecimen_ID_DNA Kids_First_Biospecimen_ID_RNA cancer_predispositions tp53_score SNV_indel_counts_latest CNV_loss_counts_latest SV_counts_latest HGVSp_Short_latest CNV_loss_evidence_latest SV_type_latest hotspot_latest activating_latest tp53_altered_latest
7316-109 BS_VTTTQYQA BS_KYF0Q0E7 None documented 0.76206989 0 0 0 NA NA NA 0 0 Other
7316-2322 BS_H1K33JVK BS_D144EJRQ None documented 0.8977062 0 0 0 NA NA NA 0 0 Other
7316-2562 BS_VKH9KYDB BS_KN92S7YQ None documented 0.63766482 0 0 0 NA NA NA 0 0 Other
7316-3058 BS_QWM9BPDY BS_BWBDH9GM Other inherited conditions NOS 0.78640104 0 0 0 NA NA NA 0 0 Other
7316-313 BS_A4KYP5H0 BS_HHPA8NJ2 None documented 0.79474203 0 0 0 NA NA NA 0 0 Other
7316-937 BS_QGX93WPF BS_J8G4SH4Z None documented 0.79719568 0 0 0 NA NA NA 0 0 Other

Reproducibility Checklist

Documentation Checklist

kgaonkar6 commented 3 years ago

After updating to use all copy losses compared in samples with >2 ploidy via https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1073/commits/96a5d54e7ad701ab9077d701bcc0778d01029650

We now have only the 7 samples that have updated tp53_altered status from using hotspot maf in addition to consensus mafs:   sample_id Kids_First_Biospecimen_ID_DNA Kids_First_Biospecimen_ID_RNA cancer_predispositions_latest tp53_score SNV_indel_counts_latest CNV_loss_counts_latest SV_counts_latest HGVSp_Short_latest CNV_loss_evidence_latest SV_type_latest hotspot_latest activating_latest tp53_altered_latest
1 7316-1746 BS_68TZMZH1 BS_0RQ4P069 None documented 0.4755520 1 0 0 p.Y163C NA NA 1 0 loss
2 7316-2189 BS_02YBZSBY BS_HJRTC9JQ Other inherited conditions NOS 0.6618956 2 0 0 p.R306*, p.R273C NA NA 1 1 activated
3 7316-2753 BS_WDTT7PG2 BS_YMAJC22S None documented 0.6804005 1 0 0 p.Y236Hfs*8 NA NA 0 0 loss
4 7316-3221 BS_FK3B5SDH NA NA NA 1 0 0 p.L265P NA NA 1 0 loss
5 7316-3631 BS_ST3Z2B9B BS_NGHK9RZP None documented 0.1675364 2 0 0 p.X261_splice, p.X307_splice NA NA 0 0 loss
6 7316-3920 BS_E0S2Y0TS NA None documented NA 2 0 0 p.X261_splice, p.X307_splice NA NA 0 0 loss
7 7316-901 BS_1JGQPJH3 BS_A3QZB9Y2 None documented 0.3239602 2 0 0 p.X187_splice, p.X261_splice NA NA 0 0 loss
kgaonkar6 commented 3 years ago

This is not related to the hotspot_maf/ consensus CNV updates... But going through the results, I also found the following condition where the SNV is "activating" but the sample also has a CNV loss . The current tp53_altered status == "activated" is given to any sample_id which has the activating SNV at c("273","248") protein position and does not consider if a CNV loss exists, does this sound ok?

  sample_id Kids_First_Biospecimen_ID_DNA Kids_First_Biospecimen_ID_RNA cancer_predispositions tp53_score SNV_indel_counts CNV_loss_counts SV_counts HGVSp_Short CNV_loss_evidence SV_type hotspot activating tp53_altered
1 7316-3058 BS_P0QJ1QAH BS_D29RPBSZ Other inherited conditions NOS 0.8367765 1 1 0 p.R273H 1 NA 1 1 activated
2 7316-388 BS_823V5X6Z BS_RX1YTZ7F None documented 0.5277632 1 1 0 p.R248W 2 NA 1 1 activated
3 7316-461 BS_P4K6WK9Y BS_TRKH2SPE None documented 0.9429763 1 1 0 p.R273H 1 NA 1 1 activated
4 7316-956 BS_MWZCP1XW BS_B9V8RGTA Other inherited conditions NOS 0.9611167 1 1 0 p.R273C 1 NA 1 1 activated
jharenza commented 3 years ago

This is not related to the hotspot_maf/ consensus CNV updates...

But going through the results, I also found the following condition where the SNV is "activating" but the sample also has a CNV loss . The current tp53_altered status == "activated" is given to any sample_id which has the activating SNV at c("273","248") protein position and does not consider if a CNV loss exists, does this sound ok?

  | sample_id | Kids_First_Biospecimen_ID_DNA | Kids_First_Biospecimen_ID_RNA | cancer_predispositions | tp53_score | SNV_indel_counts | CNV_loss_counts | SV_counts | HGVSp_Short | CNV_loss_evidence | SV_type | hotspot | activating | tp53_altered

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --

7316-3058 | BS_P0QJ1QAH | BS_D29RPBSZ | Other inherited conditions NOS | 0.8367765 | 1 | 1 | 0 | p.R273H | 1 | NA | 1 | 1 | activated

2 | 7316-388 | BS_823V5X6Z | BS_RX1YTZ7F | None documented | 0.5277632 | 1 | 1 | 0 | p.R248W | 2 | NA | 1 | 1 | activated

3 | 7316-461 | BS_P4K6WK9Y | BS_TRKH2SPE | None documented | 0.9429763 | 1 | 1 | 0 | p.R273H | 1 | NA | 1 | 1 | activated

4 | 7316-956 | BS_MWZCP1XW | BS_B9V8RGTA | Other inherited conditions NOS | 0.9611167 | 1 | 1 | 0 | p.R273C | 1 | NA | 1 | 1 | activated

I did notice that last night as well and I am ok with that logic.

kgaonkar6 commented 3 years ago

After the update to use Freec as default for copy_number we see some changes where the copy_number has mostly changed to 2 for samples with >=3 ploidy which is a loss compared to the ploidy but we are missing them out because of our filter to use 1 or 0 copy loss calls only.

Will you add this analysis and the plot to the 05-tp53-altered-annotation.Rmd and also update the notes at the top of the notebook to describe this?

Did you mean update code in 03-tp53-cnv-loss-domain.Rmd , this is the script that gathers the CNV losses and 05-tp53-altered-annotation.Rmd only aggregates all the alterations. I did remove previous documentation of only using <=1 copy number calls as CNV losses since we are now using all losses after reviewing that all copy number states have high inactivation image. I can add specific documentation that this filter was updated because we are now using controlfreec as default instead of cnvkit.

Also, I am not seeing those samples in the latest tp53_altered_status.tsv. I'm not seeing that code change for the updated filter, either. I do see the samples removed due to new CN consensus file in loss_overlap_domains_tp53.tsv.

Could you pull the latest changes in this PR ? I do see the updated copy number and the samples back in tp53_altered_status.tsv and loss_overlap_domains_tp53.tsv. For examples BS_2J4FG4HV has Copy number 2 and is gathered as a loss because it's ploidy is 3. Did I miss something?

Finally, we also forgot to add TP53 fusions here as additional evidence. There is only one sample with one: BS_NJ4WPQVK and it has a classifier score of 0.81, so we should capture this as a loss as well. Sure I can add in a different PR for the fusion update.