AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
100 stars 67 forks source link

Create bash script for generating analysis files, separate out subtyping steps #1399

Closed jaclyn-taroni closed 2 years ago

jaclyn-taroni commented 2 years ago

Related to a comment https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1286#issuecomment-1121072097

We're going to need to rerun subtyping ahead of release v22 (#1389). I'm proposing that we write a bash script that runs all the required molecular subtyping steps. It would probably live in the root of analyses and get documented in analyses/README.md.

I can start this this morning.

jaclyn-taroni commented 2 years ago

New plan! This exists: scripts/run-for-subtyping.sh

But we have subtyping modules that are inconsistent on their reliance on data/ vs. analyses/ files and that script currently does two things:

  1. Generates analysis or derived files to be included in a release
  2. Performs subtyping

I am going to split these two things up such that the procedure for generating a release is:

  1. Generate analysis or derived files to be included in a release (new script!)
  2. Add analysis or derived files to the release (i.e., data/)
  3. Generate the subtypes, where to accomplish that I need to make the following updates:
    • Make sure all the subtyping modules consistently use data/
    • Make sure the run-for-subtyping.sh script runs steps with files that don't get included in the data download (e.g., TP53 scores)
jaclyn-taroni commented 2 years ago

Okay I have put something together on this branch https://github.com/jaclyn-taroni/OpenPBTA-analysis/tree/jaclyn-taroni/scripting-subtyping

Summary of changes

Changes from #1400 are also included.

What this approach gets us

If, during revisions, workflow output files change, we have to run both scripts. If some of the underlying analysis modules change, we have everything that goes into the data download in one place. BUT if workflow files don't change, and neither do the underlying analysis modules, we only have to run the subtyping script if we change subtyping modules and that lets us skip some of the time and compute intensive steps.

Probably the "right" way to do this would be to use Snakemake but here we are.

Side note: I think this is part of the way to @sjspielman's comment here https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1286#issuecomment-1121072097. We'd have to add a notion of an "other modules" script to fully address it though.

Next steps

Here's what I think should happen next (high-level, caveats below):

  1. We start a v22 release that has all of the PBTA data files (i.e., upstream files) included
  2. We run scripts/generate-analysis-files-for-release.sh, which should generate all the analysis files for a release (⚠️ I have only tested the local part of this script so far) and put them in scratch/analysis_files_for_release, and commit any changes to files that are included in the repository
  3. We add all the analysis files to the release
  4. We run scripts/run-for-subtyping.sh and commit any changes to files that are included in the repository
  5. We add pbta-histologies.tsv to the release

Caveats:

sjspielman commented 2 years ago

We'd have to use an instance with a lot of RAM to get this done (cc: @sjspielman what did you end up using most recently for figure generation?)

So far I've only done local runs of generate-figures.sh which doesn't need anything special. To create S2 figures, I had to rerun the snv-callers module. We spun up a 128 GB instance for this, but docker stats never showed RAM getting about maybe 30-40 GB.