Closed jaclyn-taroni closed 2 years ago
@zhangb1 will work on #3 above when it's time. Cc @yuankunzhu
Wanted to note that we had some issue with medulloPackage
not being installed on some Docker containers if the layer was cached (example linked over on https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1425). When I went to try and look into that, it took almost 3 hours to build the image alone. That means that we would definitely have an issue with timeouts in CI and that could be a problem as we try to get a bunch of stuff in! So what I did in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1425/commits/43fa1602430cbb846cebb45370b72605bf4f4e3a was to consolidate all of the R package installs from Bioconductor and CRAN. We can see if that speeds building the image up. If it does, we might want that to be the first thing to go in. #1425 is a draft for now.
The changes related to the analysis files generation and subtyping can go in independently. Since some of the PRs ensuring that subtyping modules are already approved & ready to merge, I'm going to start merging them.
Updated the initial comment, but to summarize what comes next:
In progress - just a few notes as I work on this
create_subset_files.sh
seems to look for a structure relative to itself, so gotta cd
to it's location first before runningAfter making those adjustments, I got the error:
Reading in ../../data/release-v22-20220505/pbta-sv-manta.tsv.gz ...
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 619873 33.2 11776407 629.0 9048557 483.3
Vcells 269327097 2054.9 2323008920 17723.2 2753752554 21009.5
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
Calls: sample -> sample.int
Execution halted
Ok, this may be coming from not having consensus_seg_with_status.tsv
in the subset, but odd that this is a manta error. Updating the code to capture the new file and testing locally now.
After updating the code to add the additional CNV file, I ran using
RUN_LOCAL=1 ./analyses/create-subset-files/create_subset_files.sh
but gets killed:
create_subset_files.sh: line 51: 7389 Killed Rscript --vanilla 01-get_biospecimen_identifiers.R --data_directory $FULL_DIRECTORY --output_file $BIOSPECIMEN_FILE --num_matched $NUM_MATCHED --local $RUN_LOCAL
If I try to just copy and not subset, I also get an error about directory creation, which @migbro also mentioned he didn't have in his run.
root@334c2bfa1312:/home/rstudio/kitematic/analyses/create-subset-files# bash create_subset_files.sh
cp: target '../../data/testing/release-v22-20220505' is not a directory
2:55 but also if i try to not to subset, just to copy, it tells me dir not found root@334c2bfa1312:/home/rstudio/kitematic/analyses/create-subset-files# bash create_subset_files.sh cp: target '../../data/testing/release-v22-20220505' is not a directory
I think I am ready to push, however, I don't have that kind of access to this project. Can I be granted such access please? Thanks!
The purpose of this issue is to outline the order in which things need to be reviewed and run prior to the release of v22.
1. Reviews before release
Pull requests that are marked as
blocking release
andreview before release
should be reviewed & ideally merged in the following order. cc: @sjspielmanAnalysis files generation
[x] #1400
[x] #1419
[x] #1420 – depends on #1419
[x] #1421 ✅ Approved – depends on #1419 and #1420
[x] #1412 includes the changes that come from running
scripts/generate-analysis-files-for-release.sh
Subtyping modules need to use
data/
[x] #1413
[x] #1415
[x] #1418
[x] #1414
Subtype script & docs
2. Running subtyping
Once #1422 is in, we can use the PBTA data and analysis files from v22 to run
scripts/run-for-subtyping.sh
. We’ll need to create a branch to commit back any changes to files in the repository (akin to what’s happening with #1412).Then we’ll upload
pbta-histologies.tsv
to S3 (related: #1424) .3. Creating CI files
Once all the v22 files are on S3, we’re ready to generate the CI files (#1369). Here’s what we need to do:
Create a branch to update the
create-subset-files
module as required to create the CI files. I’m going to call thisv22-ci
for illustrative purposes.Specifically, we need to update this line to
release-v22-20220505
reflect the new release: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/e72c11dbebed2377072df43150fd15f2ffa0262a/analyses/create-subset-files/create_subset_files.sh#L10Then we should check to make sure it runs locally without issue with:
Start an AWS instance with at least 128 GB RAM
Clone
AlexsLemonade/OpenPBTA-analysis
and checkoutv22-release
(https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1424).Run
bash download-data.sh
to obtain the v22 data while on that branch.Add whatever the relevant remote is (whichever fork created
v22-ci
) and checkoutv22-ci
which contains the required changes to generate the v22 files for CI.Run
./analyses/create-subset-files/create_subset_files.sh
.Snag the files in
data/testing/release-v22-20220505
; they are what we need for CI.Commit to
v22-ci
+ push the changes toanalyses/create-subset-files/biospecimen_ids_for_subset.RDS
Upload the CI files to https://s3.amazonaws.com/d3b-openaccess-us-east-1-prd-pbta/data/testing
File a PR with the changes in
v22-ci
We’ll use this PR to make sure there are no issues with the CI files. If there are issues, we may need to create a series of branches to fix them. If there are no issues, we can merge this.
4. Cutting a release
1424 gets updated to include the required documentation & merged