Make workflow steps involding bundles agnostic of worker node

Release Steps

Add config properties, create corresponding directories On the worker node, add the following two config properties that are being introduced in this issue to workers/workers/config/production.py.

config['paths']['RAW_DATA']['bundle']['generate']
config['paths']['RAW_DATA']['bundle']['stage']
config['paths']['DATA_PRODUCT']['bundle']['generate']
config['paths']['DATA_PRODUCT']['bundle']['stage']

In the config above, the ['bundle']['generate'] property determines where the bundle will be generated by the archive step. The ['bundle']['stage'] property determines where the bundle will be downloaded (from SDA) to and staged to by the stage step. All four of these directories should distinct.

The exact location of these directories will depend on the Bioloop instance being released. As an example, the Bioloop dev environment uses the following directories:

'bundle': {
                'generate': '/N/scratch/scadev/bioloop/dev/bundles/raw_data',
                'stage': '/N/scratch/scadev/bioloop/dev/staged/raw_data/bundles',
            }

'bundle': {
                'generate': '/N/scratch/scadev/bioloop/dev/bundles/data_products',
                'stage': '/N/scratch/scadev/bioloop/dev/staged/data_products/bundles',
            }

Once these directories have been settled upon, add them to production.py.

Note: Some Bioloop instances have multiple worker nodes. In such applications, the archive and stage steps may be running on different machines (CPA is an example of this), which means the ['bundle']['stage'] and ['bundle']['generate'] properties will refer to paths on different machines.

Perform Prisma migrations On the node running the API container, run within the api container:

npx prisma migrate deploy

Deploy On the node running the API container:

cd [app directory]
bin/deploy.sh

Bundle population The bundle table will need to be populated for the bundle downloads to work from the browser. For this, run the script populate_bundles. Ensure that the workers/ecosystem.config.js has the config for the script:

...,
    {
      name: "populate_bundles",
      script: "python",
      args: "-u -m workers.scripts.populate_bundles",
      watch: false,
      interpreter: "",
      log_date_format: "YYYY-MM-DD HH:mm Z",
      error_file: "../logs/workers/populate_bundles.err",
      out_file: "../logs/workers/populate_bundles.log",
      autorestart: false,
    }
...

Restart pm2, which will kick off the populate_bundles script.

This script will populate the bundle table for each dataset, and unstage all datasets. After that, these datasets will have to go through the stage_dataset > validate_dataset > setup_dataset_download workflow (can be launched by operators from the UI) for the bundles to become available for download.

Steps (after restarting poetry):

cd workers
pm2 restart ecosystem.config.js

Verify that bundles are populated in Postgres bundle table:

select *
from 
    dataset left join bundle
    on
        dataset.id = bundle.dataset_id
;

Verify bundle downloads from the browser:

Go to a dataset's page
Stage a dataset
Click Download button
From the download modal, select the Download Archive - Transfer of file will use ... of bandwidth option.

Cleanup

If bundle downloads are working, remove (or disable) the populate_bundles script, so it is not launched again accidentally. This can be done by removing its entry from workers/ecosystem.config.js, followed by another pm2 restart.
Remove the old directories introduced in ticket #102. At the moment of writing this, this should only be needed for CFNDAP, since CFNDAP is the only instance where ticket #102 has been released to at this time. When ticket #205 is released to CFNDAP in the future, those directories will be redundant and will need to be cleaned up. These directories are:

(On colo24)

/N/scratch/radyuser/cfndap/production/stage/source_data/bundles
/N/scratch/radyuser/cfndap/production/stage/raw_data/bundles
/N/scratch/radyuser/cfndap/production/stage/data_products/bundles

In a future release, remove the now redundant bundle_size column from the dataset table, and replace usages of the dataset table's bundle_size column with the bundle table's size column. At the time of writing this, Bioloop code has been updated to use the size from the bundle table's size column instead of from the dataset table's bundle_size column. However, other instances of Bioloop may have other usages of the bundle_size column that are not present in the Bioloop repo, and therefore may not have been accounted for when developing this ticket. These usages will need to be replaced with the bundle table's size column.

IUSCA / bioloop

Make workflow steps involding bundles agnostic of worker node #205

Release Steps