Closed ri-pandey closed 8 months ago
This implementation should be considered outdated, since it has been revised by ticket #205. For releasing the bundle download feature to a bioloop instance, follow the release steps documented on #205, instead of the ones below.
This will require the following steps to be performed before/after release:
production.py
:
config['paths']['RAW_DATA']['bundle']
config['paths']['DATA_PRODUCT']['bundle']
The values for these properties will depend on the env. The way to evaluate these properties' values is:
config['paths']['RAW_DATA']['bundle'] -> config['paths']['RAW_DATA']['stage'] / bundles
config['paths']['DATA_PRODUCT']['bundle'] -> config['paths']['DATA_PRODUCT']['stage'] / bundles
As examples, these values would resolve to the following in the Bioloop dev env:
config['paths']['RAW_DATA']['bundle'] -> /N/scratch/scadev/bioloop/dev/staged/raw_data/bundles
config['paths']['DATA_PRODUCT']['bundle'] -> /N/scratch/scadev/bioloop/dev/staged/data_product/bundles
Note: For Bioloop instances having multiple worker nodes (like CPA), these properties will need to be configured on the node that runs the stage_dataset, validate_dataset, and setup_dataset_download steps.
Do a Prisma migration so that the new bundle
table is created
npx prisma migrate deploy
Deploy the application
populate_bundles
. Ensure that the workers/ecosystem.config.js
has the config for the script:...,
{
name: "populate_bundles",
script: "python",
args: "-u -m workers.scripts.populate_bundles",
watch: false,
interpreter: "",
log_date_format: "YYYY-MM-DD HH:mm Z",
error_file: "../logs/workers/populate_bundles.err",
out_file: "../logs/workers/populate_bundles.log",
autorestart: false,
}
...
For Bioloop instances having multiple worker nodes (like CPA), this script will need to be run on the node that runs the stage_dataset, validate_dataset, and setup_dataset_download steps.
Restart pm2.
After pm2 is restarted, the script workers/workers/scripts/populate_bundles.py
will kick off automatically. This script will populate the bundle table for each dataset, and unstage all datasets. These datasets will have to go through the stage_dataset > validate_dataset > setup_dataset_download
workflow for the bundles to become available for download.
Steps (after restarting poetry):
pm2 ls
# note down id of the populate_bundles task
pm2 restart [id of populate_bundles task]
Remove the entry for populate_bundles
from workers/ecosystem.config.js
In a future release, remove the now redundant bundle_size
column from the dataset
table. Remove usages of bundle_size
from the project as well. At the time of writing this, Bioloop has been updated to use the size from the bundle
table instead of from the dataset
table. Any usages of bundle_size
in the project simply need to be removed. The original PR issued for ticket #102 (176) can help with determining the places in the code where bundle_size
should be removed from.
In ticket #102 we added the functionality to persist bundle metadata in a separate table. In that ticket, we settled on an initial approach for populating bundle metadata, which downloads bundles from the SDA, computes their size, checksum, etc., and uses that info to persist the bundle metadata.
Since downloading all datasets from the SDA can take a long time, we should rewrite this script so that it can populate bundles without needing to download datasets.
poulate_bundles.py was the original script introduced in ticket #102, which downloaded datasets from the SDA.