jaclyn-taroni commented 2 years ago

The purpose of this issue is to outline the order in which things need to be reviewed and run prior to the release of v22.

1. Reviews before release

Pull requests that are marked as blocking release and review before release should be reviewed & ideally merged in the following order. cc: @sjspielman

[x] Refactoring the Docker image build and fixing the push to Dockerhub in CI: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1425

Analysis files generation

[x] #1400
[x] #1419
[x] #1420 – depends on #1419
[x] #1421 ✅ Approved – depends on #1419 and #1420
[x] #1412 includes the changes that come from running scripts/generate-analysis-files-for-release.sh

Subtyping modules need to use `data/`

[x] #1413
[x] #1415
[x] #1418
[x] #1414

Subtype script & docs

[x] #1422 ✅ Approved – Running these changes is blocked by the above section, so let’s review and merge those first!
[ ] #1423 — this doesn’t strictly need to be reviewed prior to release because this is just docs, but it would make sense conceptually.

2. Running subtyping

Once #1422 is in, we can use the PBTA data and analysis files from v22 to run scripts/run-for-subtyping.sh. We’ll need to create a branch to commit back any changes to files in the repository (akin to what’s happening with #1412).

Then we’ll upload pbta-histologies.tsv to S3 (related: #1424) .

3. Creating CI files

Once all the v22 files are on S3, we’re ready to generate the CI files (#1369). Here’s what we need to do:

Create a branch to update the create-subset-files module as required to create the CI files. I’m going to call this v22-ci for illustrative purposes.

Specifically, we need to update this line to release-v22-20220505 reflect the new release: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/e72c11dbebed2377072df43150fd15f2ffa0262a/analyses/create-subset-files/create_subset_files.sh#L10

Then we should check to make sure it runs locally without issue with:
```
RUN_LOCAL=1 ./analyses/create-subset-files/create_subset_files.sh
```
Start an AWS instance with at least 128 GB RAM
Clone AlexsLemonade/OpenPBTA-analysis and checkout v22-release (https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1424).
Run bash download-data.sh to obtain the v22 data while on that branch.
Add whatever the relevant remote is (whichever fork created v22-ci) and checkout v22-ci which contains the required changes to generate the v22 files for CI.
Run ./analyses/create-subset-files/create_subset_files.sh.
Snag the files in data/testing/release-v22-20220505; they are what we need for CI.
Commit to v22-ci + push the changes to analyses/create-subset-files/biospecimen_ids_for_subset.RDS
Upload the CI files to https://s3.amazonaws.com/d3b-openaccess-us-east-1-prd-pbta/data/testing
File a PR with the changes in v22-ci

We’ll use this PR to make sure there are no issues with the CI files. If there are issues, we may need to create a series of branches to fix them. If there are no issues, we can merge this.

4. Cutting a release

1424 gets updated to include the required documentation & merged

jharenza commented 2 years ago

@zhangb1 will work on #3 above when it's time. Cc @yuankunzhu

jaclyn-taroni commented 2 years ago

Wanted to note that we had some issue with medulloPackage not being installed on some Docker containers if the layer was cached (example linked over on https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1425). When I went to try and look into that, it took almost 3 hours to build the image alone. That means that we would definitely have an issue with timeouts in CI and that could be a problem as we try to get a bunch of stuff in! So what I did in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1425/commits/43fa1602430cbb846cebb45370b72605bf4f4e3a was to consolidate all of the R package installs from Bioconductor and CRAN. We can see if that speeds building the image up. If it does, we might want that to be the first thing to go in. #1425 is a draft for now.

jaclyn-taroni commented 2 years ago

The changes related to the analysis files generation and subtyping can go in independently. Since some of the PRs ensuring that subtyping modules are already approved & ready to merge, I'm going to start merging them.

jaclyn-taroni commented 2 years ago

Updated the initial comment, but to summarize what comes next:

1425 should be reviewed and go in to fix the deploy step in CI
1420 needs to be re-reviewed, approved, and merged
1421 needs to be merged (currently approved & comments addressed!)
We should probably rerun things with the code changes (#1412) tbh
1414 needs to be re-reviewed, approved, and merged
1422 needs to be merged (currently approved and comments addressed)
We can start on everything else (e.g., running subtyping)

migbro commented 2 years ago

In progress - just a few notes as I work on this

It seems step one is really a summary, and not really a step. Doing that before the rest is probably not gonna pan out
With a fresh ec2 as a starting point, def gotta download first before running that test
This script create_subset_files.sh seems to look for a structure relative to itself, so gotta cd to it's location first before running

After making those adjustments, I got the error:

Reading in ../../data/release-v22-20220505/pbta-sv-manta.tsv.gz ...
            used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells    619873   33.2   11776407   629.0    9048557   483.3
Vcells 269327097 2054.9 2323008920 17723.2 2753752554 21009.5
Error in sample.int(length(x), size, replace, prob) :
  invalid first argument
Calls: sample -> sample.int
Execution halted

jharenza commented 2 years ago

Ok, this may be coming from not having consensus_seg_with_status.tsv in the subset, but odd that this is a manta error. Updating the code to capture the new file and testing locally now.

jharenza commented 2 years ago

After updating the code to add the additional CNV file, I ran using

RUN_LOCAL=1 ./analyses/create-subset-files/create_subset_files.sh

but gets killed:

create_subset_files.sh: line 51:  7389 Killed                  Rscript --vanilla 01-get_biospecimen_identifiers.R --data_directory $FULL_DIRECTORY --output_file $BIOSPECIMEN_FILE --num_matched $NUM_MATCHED --local $RUN_LOCAL

If I try to just copy and not subset, I also get an error about directory creation, which @migbro also mentioned he didn't have in his run.

root@334c2bfa1312:/home/rstudio/kitematic/analyses/create-subset-files# bash create_subset_files.sh 
cp: target '../../data/testing/release-v22-20220505' is not a directory

2:55 but also if i try to not to subset, just to copy, it tells me dir not found root@334c2bfa1312:/home/rstudio/kitematic/analyses/create-subset-files# bash create_subset_files.sh cp: target '../../data/testing/release-v22-20220505' is not a directory

migbro commented 2 years ago

I think I am ready to push, however, I don't have that kind of access to this project. Can I be granted such access please? Thanks!

AlexsLemonade / OpenPBTA-analysis

Plan: Road to v22 #1426

1. Reviews before release

Analysis files generation

Subtyping modules need to use `data/`

Subtype script & docs

2. Running subtyping

3. Creating CI files

4. Cutting a release

1424 gets updated to include the required documentation & merged

1425 should be reviewed and go in to fix the deploy step in CI

1420 needs to be re-reviewed, approved, and merged

1421 needs to be merged (currently approved & comments addressed!)

1414 needs to be re-reviewed, approved, and merged

1422 needs to be merged (currently approved and comments addressed)

AlexsLemonade / OpenPBTA-analysis

Plan: Road to v22 #1426

1. Reviews before release

Analysis files generation

Subtyping modules need to use data/

Subtype script & docs

2. Running subtyping

3. Creating CI files

4. Cutting a release

1424 gets updated to include the required documentation & merged

1425 should be reviewed and go in to fix the deploy step in CI

1420 needs to be re-reviewed, approved, and merged

1421 needs to be merged (currently approved & comments addressed!)

1414 needs to be re-reviewed, approved, and merged

1422 needs to be merged (currently approved and comments addressed)

Subtyping modules need to use `data/`