Create bash script for generating analysis files, separate out subtyping steps

jaclyn-taroni commented 2 years ago

We're going to need to rerun subtyping ahead of release v22 (#1389). I'm proposing that we write a bash script that runs all the required molecular subtyping steps. It would probably live in the root of analyses and get documented in analyses/README.md.

I can start this this morning.

jaclyn-taroni commented 2 years ago

New plan! This exists: scripts/run-for-subtyping.sh

But we have subtyping modules that are inconsistent on their reliance on data/ vs. analyses/ files and that script currently does two things:

Generates analysis or derived files to be included in a release
Performs subtyping

I am going to split these two things up such that the procedure for generating a release is:

Generate analysis or derived files to be included in a release (new script!)
Add analysis or derived files to the release (i.e., data/)
Generate the subtypes, where to accomplish that I need to make the following updates:
- Make sure all the subtyping modules consistently use data/
- Make sure the run-for-subtyping.sh script runs steps with files that don't get included in the data download (e.g., TP53 scores)

jaclyn-taroni commented 2 years ago

Okay I have put something together on this branch https://github.com/jaclyn-taroni/OpenPBTA-analysis/tree/jaclyn-taroni/scripting-subtyping

Summary of changes

The subtyping modules now all use data/ files: https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/cfaed472c4ff95f8571562343f9418469bfcd24a, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/efb9d6610cb00ff125fdef9249ba00baa0d21ced, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/1bdc82327a971d89cefb76ebe2a5d78205e512bc, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/273f2444127a5ba333e22bdb639954dd3fff2dbf
There are a number of analysis files that are included in data downloads now. Many of those modules were run in the scripts/run-for-subtyping.sh script previously. Because I've added scripts/generate-analysis-files-for-release.sh, I updated those modules to use a new environmental variable called RUN_FOR_RELEASE: https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/95b5b8b34178739e6efcd3456a66eec19a4a03e8, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/d5fc917f24a299d92d104384305159e00d584587, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/4865438ff3a1191e350af58de35732216806be3d and moved them out of scripts/run-for-subtyping.sh
As mentioned above, I've added scripts/generate-analysis-files-for-release.sh specifically for generating analysis or derived files to be included in a new release: https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/5c295e990dc378590eb974bd35c2d16e3372bc73
There are 3 modules that have output that is used in subtyping but are not included in the download. These modules do need to be run in scripts/run-for-subtyping.sh. In two cases, I've simplified or introduced logic for when those get run for subtyping: https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/56fb898eed437217b0ecfbf5ad60620eed774f7a, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/8d44a6c227a402b504d8f763f8712e3227ebb7f4
I've taken out the analysis steps that have files included in data download and added chromosomal-instability to scripts/run-for-subtyping.sh https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/f95853e50c92fa850a156f00cd914808e88cf098

Changes from #1400 are also included.

What this approach gets us

If, during revisions, workflow output files change, we have to run both scripts. If some of the underlying analysis modules change, we have everything that goes into the data download in one place. BUT if workflow files don't change, and neither do the underlying analysis modules, we only have to run the subtyping script if we change subtyping modules and that lets us skip some of the time and compute intensive steps.

Probably the "right" way to do this would be to use Snakemake but here we are.

Side note: I think this is part of the way to @sjspielman's comment here https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1286#issuecomment-1121072097. We'd have to add a notion of an "other modules" script to fully address it though.

Next steps

Here's what I think should happen next (high-level, caveats below):

We start a v22 release that has all of the PBTA data files (i.e., upstream files) included
We run scripts/generate-analysis-files-for-release.sh, which should generate all the analysis files for a release (⚠️ I have only tested the local part of this script so far) and put them in scratch/analysis_files_for_release, and commit any changes to files that are included in the repository
We add all the analysis files to the release
We run scripts/run-for-subtyping.sh and commit any changes to files that are included in the repository
We add pbta-histologies.tsv to the release

Caveats:

We'd have to use an instance with a lot of RAM to get this done (cc: @sjspielman what did you end up using most recently for figure generation?)
We'd need to figure out how to handle the md5 situation for the download. Specifically, I'd imagine we'd want to only include the PBTA data files and their checksums at step 1, run the data download, run step 2. Similarly, we'd add the analysis files and their checksums to the download at step 3, and run the data download before step 4.

sjspielman commented 2 years ago

We'd have to use an instance with a lot of RAM to get this done (cc: @sjspielman what did you end up using most recently for figure generation?)

So far I've only done local runs of generate-figures.sh which doesn't need anything special. To create S2 figures, I had to rerun the snv-callers module. We spun up a 128 GB instance for this, but docker stats never showed RAM getting about maybe 30-40 GB.

AlexsLemonade / OpenPBTA-analysis