IUSCA / bioloop

Scientific data management portal and pipeline application template
Other
5 stars 2 forks source link

Make workflow steps involding bundles agnostic of worker node #205

Closed ri-pandey closed 7 months ago

ri-pandey commented 7 months ago

In tickets #102 (and follow-up ticket #205) we introduced the concept of bundles and bundle downloads, with the bundle's metadata being persisted in the added bundle table.

The archive, stage and download steps were rewritten to facilitate this change. The rewrite introduced a dedicated path for storing bundles, which is where the archive steps creates the bundle, and the stage/download steps try to read the bundle from.

However, on Bioloop instances with multiple worker instances, it's possible for the archive and stage steps to run on different machines, which would result in the archive step generating the bundle on a different machine than the one that the stage step reads the bundle directory from.

The process should be rewritten, so that the archive step doesn't persist a path in the bundle table. The stage/download steps should know the expected path where the bundle is to be found through env-dependent config.

ri-pandey commented 7 months ago

Release Steps

Add config properties, create corresponding directories On the worker node, add the following two config properties that are being introduced in this issue to workers/workers/config/production.py.

config['paths']['RAW_DATA']['bundle']['generate']
config['paths']['RAW_DATA']['bundle']['stage']
config['paths']['DATA_PRODUCT']['bundle']['generate']
config['paths']['DATA_PRODUCT']['bundle']['stage']

In the config above, the ['bundle']['generate'] property determines where the bundle will be generated by the archive step. The ['bundle']['stage'] property determines where the bundle will be downloaded (from SDA) to and staged to by the stage step. All four of these directories should distinct.

The exact location of these directories will depend on the Bioloop instance being released. As an example, the Bioloop dev environment uses the following directories:

'bundle': {
                'generate': '/N/scratch/scadev/bioloop/dev/bundles/raw_data',
                'stage': '/N/scratch/scadev/bioloop/dev/staged/raw_data/bundles',
            }
'bundle': {
                'generate': '/N/scratch/scadev/bioloop/dev/bundles/data_products',
                'stage': '/N/scratch/scadev/bioloop/dev/staged/data_products/bundles',
            }

Once these directories have been settled upon, add them to production.py.

Note: Some Bioloop instances have multiple worker nodes. In such applications, the archive and stage steps may be running on different machines (CPA is an example of this), which means the ['bundle']['stage'] and ['bundle']['generate'] properties will refer to paths on different machines.

Perform Prisma migrations On the node running the API container, run within the api container:

npx prisma migrate deploy

Deploy On the node running the API container:

cd [app directory]
bin/deploy.sh

Bundle population The bundle table will need to be populated for the bundle downloads to work from the browser. For this, run the script populate_bundles. Ensure that the workers/ecosystem.config.js has the config for the script:

...,
    {
      name: "populate_bundles",
      script: "python",
      args: "-u -m workers.scripts.populate_bundles",
      watch: false,
      interpreter: "",
      log_date_format: "YYYY-MM-DD HH:mm Z",
      error_file: "../logs/workers/populate_bundles.err",
      out_file: "../logs/workers/populate_bundles.log",
      autorestart: false,
    }
...

Restart pm2, which will kick off the populate_bundles script.

This script will populate the bundle table for each dataset, and unstage all datasets. After that, these datasets will have to go through the stage_dataset > validate_dataset > setup_dataset_download workflow (can be launched by operators from the UI) for the bundles to become available for download.

Steps (after restarting poetry):

cd workers
pm2 restart ecosystem.config.js

Verify that bundles are populated in Postgres bundle table:

select *
from 
    dataset left join bundle
    on
        dataset.id = bundle.dataset_id
;

Verify bundle downloads from the browser:

  1. Go to a dataset's page
  2. Stage a dataset
  3. Click Download button
  4. From the download modal, select the Download Archive - Transfer of file will use ... of bandwidth option.

Cleanup

  1. If bundle downloads are working, remove (or disable) the populate_bundles script, so it is not launched again accidentally. This can be done by removing its entry from workers/ecosystem.config.js, followed by another pm2 restart.

  2. Remove the old directories introduced in ticket #102. At the moment of writing this, this should only be needed for CFNDAP, since CFNDAP is the only instance where ticket #102 has been released to at this time. When ticket #205 is released to CFNDAP in the future, those directories will be redundant and will need to be cleaned up. These directories are:

(On colo24)

/N/scratch/radyuser/cfndap/production/stage/source_data/bundles
/N/scratch/radyuser/cfndap/production/stage/raw_data/bundles
/N/scratch/radyuser/cfndap/production/stage/data_products/bundles
  1. In a future release, remove the now redundant bundle_size column from the dataset table, and replace usages of the dataset table's bundle_size column with the bundle table's size column. At the time of writing this, Bioloop code has been updated to use the size from the bundle table's size column instead of from the dataset table's bundle_size column. However, other instances of Bioloop may have other usages of the bundle_size column that are not present in the Bioloop repo, and therefore may not have been accounted for when developing this ticket. These usages will need to be replaced with the bundle table's size column.