Closed jaclyn-taroni closed 2 years ago
New plan! This exists: scripts/run-for-subtyping.sh
But we have subtyping modules that are inconsistent on their reliance on data/
vs. analyses/
files and that script currently does two things:
I am going to split these two things up such that the procedure for generating a release is:
data/
)data/
run-for-subtyping.sh
script runs steps with files that don't get included in the data download (e.g., TP53 scores)Okay I have put something together on this branch https://github.com/jaclyn-taroni/OpenPBTA-analysis/tree/jaclyn-taroni/scripting-subtyping
data/
files: https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/cfaed472c4ff95f8571562343f9418469bfcd24a, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/efb9d6610cb00ff125fdef9249ba00baa0d21ced, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/1bdc82327a971d89cefb76ebe2a5d78205e512bc, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/273f2444127a5ba333e22bdb639954dd3fff2dbfscripts/run-for-subtyping.sh
script previously. Because I've added scripts/generate-analysis-files-for-release.sh
, I updated those modules to use a new environmental variable called RUN_FOR_RELEASE
: https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/95b5b8b34178739e6efcd3456a66eec19a4a03e8, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/d5fc917f24a299d92d104384305159e00d584587, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/4865438ff3a1191e350af58de35732216806be3d and moved them out of scripts/run-for-subtyping.sh
scripts/generate-analysis-files-for-release.sh
specifically for generating analysis or derived files to be included in a new release: https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/5c295e990dc378590eb974bd35c2d16e3372bc73scripts/run-for-subtyping.sh
. In two cases, I've simplified or introduced logic for when those get run for subtyping: https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/56fb898eed437217b0ecfbf5ad60620eed774f7a, https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/8d44a6c227a402b504d8f763f8712e3227ebb7f4chromosomal-instability
to scripts/run-for-subtyping.sh
https://github.com/jaclyn-taroni/OpenPBTA-analysis/commit/f95853e50c92fa850a156f00cd914808e88cf098Changes from #1400 are also included.
If, during revisions, workflow output files change, we have to run both scripts. If some of the underlying analysis modules change, we have everything that goes into the data download in one place. BUT if workflow files don't change, and neither do the underlying analysis modules, we only have to run the subtyping script if we change subtyping modules and that lets us skip some of the time and compute intensive steps.
Probably the "right" way to do this would be to use Snakemake but here we are.
Side note: I think this is part of the way to @sjspielman's comment here https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1286#issuecomment-1121072097. We'd have to add a notion of an "other modules" script to fully address it though.
Here's what I think should happen next (high-level, caveats below):
scripts/generate-analysis-files-for-release.sh
, which should generate all the analysis files for a release (⚠️ I have only tested the local part of this script so far) and put them in scratch/analysis_files_for_release
, and commit any changes to files that are included in the repositoryscripts/run-for-subtyping.sh
and commit any changes to files that are included in the repositorypbta-histologies.tsv
to the releaseCaveats:
We'd have to use an instance with a lot of RAM to get this done (cc: @sjspielman what did you end up using most recently for figure generation?)
So far I've only done local runs of generate-figures.sh
which doesn't need anything special. To create S2 figures, I had to rerun the snv-callers
module. We spun up a 128 GB instance for this, but docker stats
never showed RAM getting about maybe 30-40 GB.
Related to a comment https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1286#issuecomment-1121072097
We're going to need to rerun subtyping ahead of release v22 (#1389). I'm proposing that we write a bash script that runs all the required molecular subtyping steps. It would probably live in the root of
analyses
and get documented inanalyses/README.md
.I can start this this morning.