kgaonkar6 commented 3 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

Update tp53-nf1-score with hotspots maf + consensus maf and latest consensus CNV file.

What was your approach?

added bind(consensus_maf),hotspot_maf) %>% unique() https://github.com/kgaonkar6/OpenPBTA-analysis/blob/73499652d84eeec13ec85840ba350c66f0c658df/analyses/tp53_nf1_score/05-tp53-altered-annotation.Rmd#L136-L178

add code in bash script to gather most up-to-date consensus seg file with status Rscript -e "rmarkdown::render('../focal-cn-file-preparation/02-add-ploidy-consensus.Rmd', clean = TRUE)" cp ${scratch_dir}/consensus_seg_with_status.tsv ${analysis_dir}/input/consensus_seg_with_status.tsv

What GitHub issue does your pull request address?

1072

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Should we update the TP53 CNV loss filtering process?

We had previously (cnvkit copy_number as default) only retained TP53 CNV losses with 1 or 0 copies because we had a lot of copy_number ==2 as loss (from neutral calls being assigned 2 copy).

After the update to use Freec as default for copy_number we see some changes where the copy_number has mostly changed to 2 for samples with >=3 ploidy which is a loss compared to the ploidy but we are missing them out because of our filter to use 1 or 0 copy loss calls only. Here's a snippet of the TP53 loss calls that are missed :

> overlaps[overlaps$Kids_First_Biospecimen_ID %in% c("BS_2J4FG4HV","BS_5JC116NM","BS_7M7JNG00","BS_823V5X6Z","BS_ZV21J6YW"),] 
DataFrame with 15 rows and 13 columns
                   cnv_gr Kids_First_Biospecimen_ID copy_number tumor_ploidy      status               domain_gr hgnc_symbol     pfam_id gene_start  gene_end         NAME
                <GRanges>               <character>   <numeric>    <numeric> <character>               <GRanges> <character> <character>  <integer> <integer>  <character>
1   chr17:114696-21459816               BS_2J4FG4HV           2            3        loss chr17:7670637-7673573:-        TP53     PF07710    7661779   7687550 P53_tetramer
2   chr17:114696-21459816               BS_2J4FG4HV           2            3        loss chr17:7673755-7676387:-        TP53     PF00870    7661779   7687550          P53
3   chr17:114696-21459816               BS_2J4FG4HV           2            3        loss chr17:7676390-7676582:-        TP53     PF08563    7661779   7687550      P53_TAD
4   chr17:113884-21633950               BS_5JC116NM           2            3        loss chr17:7670637-7673573:-        TP53     PF07710    7661779   7687550 P53_tetramer
5   chr17:113884-21633950               BS_5JC116NM           2            3        loss chr17:7673755-7676387:-        TP53     PF00870    7661779   7687550          P53
...                   ...                       ...         ...          ...         ...                     ...         ...         ...        ...       ...          ...
11  chr17:342848-19056601               BS_823V5X6Z           2            3        loss chr17:7673755-7676387:-        TP53     PF00870    7661779   7687550          P53
12  chr17:342848-19056601               BS_823V5X6Z           2            3        loss chr17:7676390-7676582:-        TP53     PF08563    7661779   7687550      P53_TAD
13  chr17:112392-15791752               BS_ZV21J6YW           1            2        loss chr17:7670637-7673573:-        TP53     PF07710    7661779   7687550 P53_tetramer
14  chr17:112392-15791752               BS_ZV21J6YW           1            2        loss chr17:7673755-7676387:-        TP53     PF00870    7661779   7687550          P53
15  chr17:112392-15791752               BS_ZV21J6YW           1            2        loss chr17:7676390-7676582:-        TP53     PF08563    7661779   7687550      P53_TAD
                         DESC  domain_chr
                  <character> <character>
1   P53 tetramerisation motif          17
2      P53 DNA-binding domain          17
3   P53 transactivation motif          17
4   P53 tetramerisation motif          17
5      P53 DNA-binding domain          17
...                       ...         ...
11     P53 DNA-binding domain          17
12  P53 transactivation motif          17
13  P53 tetramerisation motif          17
14     P53 DNA-binding domain          17
15  P53 transactivation motif          17

This is the distribution of the TP53 loss calls

Is there anything that you want to discuss further?

Can we add consensus_seg_with_status.tsv as output in focal-cn-file-preparation module so I don't have to run the script here?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

tables

What is your summary of the results?

tp53_alt_status_change.txt

Updated because of hotspot mutations

sample_id	Kids_First_Biospecimen_ID_DNA	Kids_First_Biospecimen_ID_RNA	cancer_predispositions	tp53_score	SNV_indel_counts_latest	HGVSp_Short_latest	CNV_loss_evidence_latest	SV_type_latest	hotspot_latest	activating_latest	tp53_altered_latest
7316-2189	BS_02YBZSBY	BS_HJRTC9JQ	Other inherited conditions NOS	0.66189561	2	p.R306*, p.R273C	NA	NA	1	1	activated
7316-2753	BS_WDTT7PG2	BS_YMAJC22S	None documented	0.68040047	1	p.Y236Hfs*8	NA	NA	0	0	loss
7316-1746	BS_68TZMZH1	BS_0RQ4P069	None documented	0.47555198	1	p.Y163C	NA	NA	1	0	loss
7316-3631	BS_ST3Z2B9B	BS_NGHK9RZP	None documented	0.16753643	2	p.X261_splice, p.X307_splice	NA	NA	0	0	loss
7316-3920	BS_E0S2Y0TS	NA	None documented	NA	2	p.X261_splice, p.X307_splice	NA	NA	0	0	loss
7316-901	BS_1JGQPJH3	BS_A3QZB9Y2	None documented	0.32396022	2	p.X187_splice, p.X261_splice	NA	NA	0	0	loss
7316-3221	BS_FK3B5SDH	NA	NA	NA	1	p.L265P	NA	NA	1	0	loss

Updated because of updated CNV calls

sample_id	Kids_First_Biospecimen_ID_DNA	Kids_First_Biospecimen_ID_RNA	cancer_predispositions	tp53_score	HGVSp_Short_latest	CNV_loss_evidence_latest	SV_type_latest	tp53_altered_latest
7316-109	BS_VTTTQYQA	BS_KYF0Q0E7	None documented	0.76206989	NA	NA	NA	Other
7316-2322	BS_H1K33JVK	BS_D144EJRQ	None documented	0.8977062	NA	NA	NA	Other
7316-2562	BS_VKH9KYDB	BS_KN92S7YQ	None documented	0.63766482	NA	NA	NA	Other
7316-3058	BS_QWM9BPDY	BS_BWBDH9GM	Other inherited conditions NOS	0.78640104	NA	NA	NA	Other
7316-313	BS_A4KYP5H0	BS_HHPA8NJ2	None documented	0.79474203	NA	NA	NA	Other
7316-937	BS_QGX93WPF	BS_J8G4SH4Z	None documented	0.79719568	NA	NA	NA	Other

Reproducibility Checklist

[x] The dependencies required to run the code in this pull request have been added to the project Dockerfile.
[x] This analysis has been added to continuous integration.

Documentation Checklist

[x] This analysis module has a README and it is up to date.
[x] This analysis is recorded in the table in analyses/README.md and the entry is up to date.
[x] The analytical code is documented and contains comments.

kgaonkar6 commented 3 years ago

After updating to use all copy losses compared in samples with >2 ploidy via https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1073/commits/96a5d54e7ad701ab9077d701bcc0778d01029650

We now have only the 7 samples that have updated tp53_altered status from using hotspot maf in addition to consensus mafs:	sample_id	Kids_First_Biospecimen_ID_DNA	Kids_First_Biospecimen_ID_RNA	cancer_predispositions_latest	tp53_score	SNV_indel_counts_latest	HGVSp_Short_latest	CNV_loss_evidence_latest	SV_type_latest	hotspot_latest	activating_latest	tp53_altered_latest
1	7316-1746	BS_68TZMZH1	BS_0RQ4P069	None documented	0.4755520	1	p.Y163C	NA	NA	1	0	loss
2	7316-2189	BS_02YBZSBY	BS_HJRTC9JQ	Other inherited conditions NOS	0.6618956	2	p.R306*, p.R273C	NA	NA	1	1	activated
3	7316-2753	BS_WDTT7PG2	BS_YMAJC22S	None documented	0.6804005	1	p.Y236Hfs*8	NA	NA	0	0	loss
4	7316-3221	BS_FK3B5SDH	NA	NA	NA	1	p.L265P	NA	NA	1	0	loss
5	7316-3631	BS_ST3Z2B9B	BS_NGHK9RZP	None documented	0.1675364	2	p.X261_splice, p.X307_splice	NA	NA	0	0	loss
6	7316-3920	BS_E0S2Y0TS	NA	None documented	NA	2	p.X261_splice, p.X307_splice	NA	NA	0	0	loss
7	7316-901	BS_1JGQPJH3	BS_A3QZB9Y2	None documented	0.3239602	2	p.X187_splice, p.X261_splice	NA	NA	0	0	loss

kgaonkar6 commented 3 years ago

This is not related to the hotspot_maf/ consensus CNV updates... But going through the results, I also found the following condition where the SNV is "activating" but the sample also has a CNV loss . The current tp53_altered status == "activated" is given to any sample_id which has the activating SNV at c("273","248") protein position and does not consider if a CNV loss exists, does this sound ok?

	sample_id	Kids_First_Biospecimen_ID_DNA	Kids_First_Biospecimen_ID_RNA	cancer_predispositions	tp53_score	SNV_indel_counts	CNV_loss_counts	HGVSp_Short	CNV_loss_evidence	SV_type	hotspot	activating	tp53_altered
1	7316-3058	BS_P0QJ1QAH	BS_D29RPBSZ	Other inherited conditions NOS	0.8367765	1	1	p.R273H	1	NA	1	1	activated
2	7316-388	BS_823V5X6Z	BS_RX1YTZ7F	None documented	0.5277632	1	1	p.R248W	2	NA	1	1	activated
3	7316-461	BS_P4K6WK9Y	BS_TRKH2SPE	None documented	0.9429763	1	1	p.R273H	1	NA	1	1	activated
4	7316-956	BS_MWZCP1XW	BS_B9V8RGTA	Other inherited conditions NOS	0.9611167	1	1	p.R273C	1	NA	1	1	activated

jharenza commented 3 years ago

This is not related to the hotspot_maf/ consensus CNV updates...

But going through the results, I also found the following condition where the SNV is "activating" but the sample also has a CNV loss . The current tp53_altered status == "activated" is given to any sample_id which has the activating SNV at c("273","248") protein position and does not consider if a CNV loss exists, does this sound ok?

| sample_id | Kids_First_Biospecimen_ID_DNA | Kids_First_Biospecimen_ID_RNA | cancer_predispositions | tp53_score | SNV_indel_counts | CNV_loss_counts | SV_counts | HGVSp_Short | CNV_loss_evidence | SV_type | hotspot | activating | tp53_altered

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --

7316-3058 | BS_P0QJ1QAH | BS_D29RPBSZ | Other inherited conditions NOS | 0.8367765 | 1 | 1 | 0 | p.R273H | 1 | NA | 1 | 1 | activated

2 | 7316-388 | BS_823V5X6Z | BS_RX1YTZ7F | None documented | 0.5277632 | 1 | 1 | 0 | p.R248W | 2 | NA | 1 | 1 | activated

3 | 7316-461 | BS_P4K6WK9Y | BS_TRKH2SPE | None documented | 0.9429763 | 1 | 1 | 0 | p.R273H | 1 | NA | 1 | 1 | activated

4 | 7316-956 | BS_MWZCP1XW | BS_B9V8RGTA | Other inherited conditions NOS | 0.9611167 | 1 | 1 | 0 | p.R273C | 1 | NA | 1 | 1 | activated

I did notice that last night as well and I am ok with that logic.

kgaonkar6 commented 3 years ago

After the update to use Freec as default for copy_number we see some changes where the copy_number has mostly changed to 2 for samples with >=3 ploidy which is a loss compared to the ploidy but we are missing them out because of our filter to use 1 or 0 copy loss calls only.

Will you add this analysis and the plot to the 05-tp53-altered-annotation.Rmd and also update the notes at the top of the notebook to describe this?

Did you mean update code in 03-tp53-cnv-loss-domain.Rmd , this is the script that gathers the CNV losses and 05-tp53-altered-annotation.Rmd only aggregates all the alterations. I did remove previous documentation of only using <=1 copy number calls as CNV losses since we are now using all losses after reviewing that all copy number states have high inactivation image. I can add specific documentation that this filter was updated because we are now using controlfreec as default instead of cnvkit.

Also, I am not seeing those samples in the latest tp53_altered_status.tsv. I'm not seeing that code change for the updated filter, either. I do see the samples removed due to new CN consensus file in loss_overlap_domains_tp53.tsv.

Could you pull the latest changes in this PR ? I do see the updated copy number and the samples back in tp53_altered_status.tsv and loss_overlap_domains_tp53.tsv. For examples BS_2J4FG4HV has Copy number 2 and is gathered as a loss because it's ploidy is 3. Did I miss something?

Finally, we also forgot to add TP53 fusions here as additional evidence. There is only one sample with one: BS_NJ4WPQVK and it has a classifier score of 0.81, so we should capture this as a loss as well. Sure I can add in a different PR for the fusion update.

AlexsLemonade / OpenPBTA-analysis

#959 Tp53 classifier rerun for subtyping #1073

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

1072

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist