Update logic in modules where analysis files included in the data releases are generated

jaclyn-taroni commented 2 years ago

Splitting up changes described in https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1399#issuecomment-1124329628.

There are a number of analysis files that are included in data downloads now. If you look at the current version of scripts/run-for-subtyping.sh, you will see that some of these are currently run in that script. However, I will split up what is required for generating analysis files for release from what is required for subtyping in subsequent pull requests.

All of the changes included here pertain to modules that will be run for generating analysis files. (Subtyping modules, where possible, will use data/ [see: #1413, #1414, #1415, and #1418].) Because the analysis file generation will happen prior to subtyping, these steps still need to use the pbta-histologies-base.tsv file. The logic I am adding or modifying will allow us to do that in subsequent PR(s).

This might not be super clear at this point, so I'll go ahead and outline how data releases will work going forward after everything goes through:

We start a release that has all of the PBTA data files (i.e., upstream files) included
We run scripts/generate-analysis-files-for-release.sh, which should generate all the analysis files for a release and put them in scratch/analysis_files_for_release, and commit any changes to files that are included in the repository. (PR with this shell script coming very soon; #1412 is an example of the commit any changes part!)
We add all the analysis files to the release.
We run scripts/run-for-subtyping.sh and commit any changes to files that are included in the repository. (PR coming soon!)
We add pbta-histologies.tsv to the release

sjspielman commented 2 years ago

Quick clarification here:

We run scripts/generate-analysis-files-for-release.sh, which should generate all the analysis files for a release and put them in scratch/analysis_files_for_release, and commit any changes to files that are included in the repository. (PR with this shell script coming very soon; https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1412 is an example of the commit any changes part!)

We add all the analysis files to the release.

The files that are added to release are those in modules, or those in the scratch/analysis_files_for_release? My reading here is that those are the same files, but they are placed into scratch/ for data release support. Am I reading this right?

jaclyn-taroni commented 2 years ago

They are in both places but compiled into scratch/analysis_files_for_release for convenience (to support data release, as you say).

jaclyn-taroni commented 2 years ago

collapse-rnaseq

* Uses metadata. **Changes may need to be made.** ⚠️

* Actually something seems off here. In the current `scripts/run-for-subtyping.sh` , I see this is run as: `OPENPBTA_BASE_SUBTYPING=1 ../analyses/collapse-rnaseq/run-collapse-rnaseq.sh`, but I do not see a corresponding way to accept this arg in analysis script.

Yea, the current scripts/run-for-subtyping.sh is wrong. The only place the metadata gets used in this module is in this file: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/e72c11dbebed2377072df43150fd15f2ffa0262a/analyses/collapse-rnaseq/00-create-rsem-files.R

Which not run via the shell script: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/e72c11dbebed2377072df43150fd15f2ffa0262a/analyses/collapse-rnaseq/run-collapse-rnaseq.sh

Because we distribute one RSEM file for each selection strategy now. I think that Rscript may be used upstream, which is to say that I am afraid to touch it really.

AlexsLemonade / OpenPBTA-analysis

Update logic in modules where analysis files included in the data releases are generated #1419