Update TP53 module to accommodate tumor purity filtering

sjspielman commented 1 year ago

Part of #1624

This PR begins the process of running TP53 with tumor purity filtered data. Since a lot of module changes had to be made to generate new results, this PR focuses only on that step. There will be a second PR that adds a final notebook to this module in order to compare these results to the original ones reported in the manuscript (those next steps are started in this branch: https://github.com/sjspielman/OpenPBTA-analysis/tree/tumor_purity-tp53-notebook).

There are a variety of changes that I made here, attempting to keep code as-is as much as possible. I created a new script run_classifier-tumor-purity-threshold.sh which runs the relevant scripts in this module to re-generate results with this filtered data. This script specifically calls (and does not call) the following for stranded data only:

01-apply-classifier.py
- This script now takes two additional flags - one to indicate if we're turning tumor purity filtering on, and another with the path to the TSV file with ids to filter to. In addition, it ensures output files are identifiable as being tumor purity filtered.
❌ 02-qc-rna_expression_score.Rmd is not run as it does not produce any output that is consumed later.
03-tp53-cnv-loss-domain.Rmd and 04-tp53-sv-loss.Rmd
- These files produce inputs that are needed for 05-tp53-altered-annotation.Rmd.
- They now include an Rmd param and file name scheme to handle whether the data is tumor purity filtered.
05-tp53-altered-annotation.Rmd
- Also now includes an Rmd param and file name scheme for export
06-evaluate-classifier.py
- generates files that are used to plot ROC curves, consuming output from 05-tp53-altered-annotation.Rmd`
❌ 07-plot-roc.R is not run. ROC plots will be separately made in the forthcoming notebook
❌ 08-compare-molecularsubtypes-tp53scores.R and 09-compare-histologies.R are not run since they are not really relevant here.

This script is documented in the README and also is in CI. After running this through and generating result files that can be analyzed, I also ran the normal pipeline again to ensure notebooks are rendered from the full dataset.

I'll request review once checks pass!

sjspielman commented 1 year ago

Since I'll be out for a couple days, just wanted to note where this is heading next. This notebook is still draft-y and not part of this PR, but sharing in case anyone is curious. Link to current notebook Rmd: https://github.com/sjspielman/OpenPBTA-analysis/blob/0a3202f71a73dad7dedec6a82612e0f74edb493d/analyses/tp53_nf1_score/10-tp53-tumor-purity-threshold.Rmd HTML for download: 10-tp53-tumor-purity-threshold.nb.html.zip

Importantly, I uncovered a couple areas where we reported outdated P-values, and 1-2 other small inconsistencies in the MS we should have our eyes on.

sjspielman commented 1 year ago

This is now ready for another look!

The biggest change is that all associated results from this new pipeline will live in results/tumor-purity-threshold/ along with the notebooks. At first I did this just for notebooks as suggested, but then I decided a bit more organization would be nice since there are many result files. This involved code updates such as:
- As suggested, I now use the output_file argument in rmarkdown::render() to specify separate HTML outputs.
- Within the threshold logic in associated scripts/Rmds, I made sure results_dir points to the right directory, while still making sure polyA results are always read in from the primary results/ directory (since polyA wasn’t invited to the tumor purity party)
- In the 06 python script, I did have to add some opt parse and function arguments to handle this. I set their defaults to what the main pipeline uses, which was hardcoded in here before these changes.
The diffs in results/tp53_scores_vs_molecular_subtype_Ependymal_tumor.tsv (and its associated plot plots/tp53_scores_vs_molecular_subtype_Ependymal_tumor.png) are present because there is actually one fewer molecular subtype at v23 compared to when this module was last run, hence the one fewer subtype in these results.
I didn’t catch before - there’s a chunk in the 05 notebook that exports several plots of tp53 altered status by cancer predisposition. I “turned this chunk off” for the tumor threshold pipeline as we do not need to create those plots.

AlexsLemonade / OpenPBTA-analysis

Update TP53 module to accommodate tumor purity filtering #1664