Closed ri-pandey closed 7 months ago
Add config properties, create corresponding directories
On the worker node, add the following two config properties that are being introduced in this issue to workers/workers/config/production.py
.
config['paths']['RAW_DATA']['bundle']['generate']
config['paths']['RAW_DATA']['bundle']['stage']
config['paths']['DATA_PRODUCT']['bundle']['generate']
config['paths']['DATA_PRODUCT']['bundle']['stage']
In the config above, the ['bundle']['generate']
property determines where the bundle will be generated by the archive
step. The ['bundle']['stage']
property determines where the bundle will be downloaded (from SDA) to and staged to by the stage
step. All four of these directories should distinct.
The exact location of these directories will depend on the Bioloop instance being released. As an example, the Bioloop dev environment uses the following directories:
'bundle': {
'generate': '/N/scratch/scadev/bioloop/dev/bundles/raw_data',
'stage': '/N/scratch/scadev/bioloop/dev/staged/raw_data/bundles',
}
'bundle': {
'generate': '/N/scratch/scadev/bioloop/dev/bundles/data_products',
'stage': '/N/scratch/scadev/bioloop/dev/staged/data_products/bundles',
}
Once these directories have been settled upon, add them to production.py
.
Note: Some Bioloop instances have multiple worker nodes. In such applications, the archive
and stage
steps may be running on different machines (CPA is an example of this), which means the ['bundle']['stage']
and ['bundle']['generate']
properties will refer to paths on different machines.
Perform Prisma migrations
On the node running the API container, run within the api
container:
npx prisma migrate deploy
Deploy On the node running the API container:
cd [app directory]
bin/deploy.sh
Bundle population
The bundle
table will need to be populated for the bundle downloads to work from the browser. For this, run the script populate_bundles
. Ensure that the workers/ecosystem.config.js
has the config for the script:
...,
{
name: "populate_bundles",
script: "python",
args: "-u -m workers.scripts.populate_bundles",
watch: false,
interpreter: "",
log_date_format: "YYYY-MM-DD HH:mm Z",
error_file: "../logs/workers/populate_bundles.err",
out_file: "../logs/workers/populate_bundles.log",
autorestart: false,
}
...
Restart pm2, which will kick off the populate_bundles script.
This script will populate the bundle table for each dataset, and unstage all datasets. After that, these datasets will have to go through the stage_dataset > validate_dataset > setup_dataset_download
workflow (can be launched by operators from the UI) for the bundles to become available for download.
Steps (after restarting poetry):
cd workers
pm2 restart ecosystem.config.js
Verify that bundles are populated in Postgres bundle
table:
select *
from
dataset left join bundle
on
dataset.id = bundle.dataset_id
;
Verify bundle downloads from the browser:
Download Archive - Transfer of file will use ... of bandwidth
option.Cleanup
If bundle downloads are working, remove (or disable) the populate_bundles
script, so it is not launched again accidentally. This can be done by removing its entry from workers/ecosystem.config.js
, followed by another pm2 restart.
Remove the old directories introduced in ticket #102. At the moment of writing this, this should only be needed for CFNDAP, since CFNDAP is the only instance where ticket #102 has been released to at this time. When ticket #205 is released to CFNDAP in the future, those directories will be redundant and will need to be cleaned up. These directories are:
(On colo24)
/N/scratch/radyuser/cfndap/production/stage/source_data/bundles
/N/scratch/radyuser/cfndap/production/stage/raw_data/bundles
/N/scratch/radyuser/cfndap/production/stage/data_products/bundles
bundle_size
column from the dataset
table, and replace usages of the dataset
table's bundle_size
column with the bundle
table's size
column. At the time of writing this, Bioloop code has been updated to use the size from the bundle
table's size
column instead of from the dataset
table's bundle_size
column. However, other instances of Bioloop may have other usages of the bundle_size
column that are not present in the Bioloop repo, and therefore may not have been accounted for when developing this ticket. These usages will need to be replaced with the bundle
table's size
column.
In tickets #102 (and follow-up ticket #205) we introduced the concept of bundles and bundle downloads, with the bundle's metadata being persisted in the added
bundle
table.The archive, stage and download steps were rewritten to facilitate this change. The rewrite introduced a dedicated path for storing bundles, which is where the archive steps creates the bundle, and the stage/download steps try to read the bundle from.
However, on Bioloop instances with multiple worker instances, it's possible for the archive and stage steps to run on different machines, which would result in the archive step generating the bundle on a different machine than the one that the stage step reads the bundle directory from.
The process should be rewritten, so that the archive step doesn't persist a path in the
bundle
table. The stage/download steps should know the expected path where the bundle is to be found through env-dependent config.