IUSCA / bioloop

Scientific data management portal and pipeline application template
Other
5 stars 2 forks source link

Rewrite bundle population script to not download archives from SDA #201

Closed ri-pandey closed 8 months ago

ri-pandey commented 8 months ago

In ticket #102 we added the functionality to persist bundle metadata in a separate table. In that ticket, we settled on an initial approach for populating bundle metadata, which downloads bundles from the SDA, computes their size, checksum, etc., and uses that info to persist the bundle metadata.

Since downloading all datasets from the SDA can take a long time, we should rewrite this script so that it can populate bundles without needing to download datasets.

poulate_bundles.py was the original script introduced in ticket #102, which downloaded datasets from the SDA.

ri-pandey commented 8 months ago

This implementation should be considered outdated, since it has been revised by ticket #205. For releasing the bundle download feature to a bioloop instance, follow the release steps documented on #205, instead of the ones below.


This will require the following steps to be performed before/after release:

Before Release

  1. Add the following properties to production.py:
    config['paths']['RAW_DATA']['bundle']
    config['paths']['DATA_PRODUCT']['bundle']

    The values for these properties will depend on the env. The way to evaluate these properties' values is:

    config['paths']['RAW_DATA']['bundle']        ->   config['paths']['RAW_DATA']['stage'] / bundles
    config['paths']['DATA_PRODUCT']['bundle'] ->   config['paths']['DATA_PRODUCT']['stage'] / bundles

    As examples, these values would resolve to the following in the Bioloop dev env:

    config['paths']['RAW_DATA']['bundle']        ->   /N/scratch/scadev/bioloop/dev/staged/raw_data/bundles
    config['paths']['DATA_PRODUCT']['bundle'] ->   /N/scratch/scadev/bioloop/dev/staged/data_product/bundles

Note: For Bioloop instances having multiple worker nodes (like CPA), these properties will need to be configured on the node that runs the stage_dataset, validate_dataset, and setup_dataset_download steps.

  1. Do a Prisma migration so that the new bundle table is created

    npx prisma migrate deploy
  2. Deploy the application

After Release

  1. Run the script populate_bundles. Ensure that the workers/ecosystem.config.js has the config for the script:
...,
    {
      name: "populate_bundles",
      script: "python",
      args: "-u -m workers.scripts.populate_bundles",
      watch: false,
      interpreter: "",
      log_date_format: "YYYY-MM-DD HH:mm Z",
      error_file: "../logs/workers/populate_bundles.err",
      out_file: "../logs/workers/populate_bundles.log",
      autorestart: false,
    }
...

For Bioloop instances having multiple worker nodes (like CPA), this script will need to be run on the node that runs the stage_dataset, validate_dataset, and setup_dataset_download steps.

Restart pm2.

After pm2 is restarted, the script workers/workers/scripts/populate_bundles.py will kick off automatically. This script will populate the bundle table for each dataset, and unstage all datasets. These datasets will have to go through the stage_dataset > validate_dataset > setup_dataset_download workflow for the bundles to become available for download.

Steps (after restarting poetry):

pm2 ls
# note down id of the populate_bundles task
pm2 restart [id of populate_bundles task]
  1. Remove the entry for populate_bundles from workers/ecosystem.config.js

  2. In a future release, remove the now redundant bundle_size column from the dataset table. Remove usages of bundle_size from the project as well. At the time of writing this, Bioloop has been updated to use the size from the bundle table instead of from the dataset table. Any usages of bundle_size in the project simply need to be removed. The original PR issued for ticket #102 (176) can help with determining the places in the code where bundle_size should be removed from.