ACED-IDP / gen3-helm

Helm charts for Gen3 Deployments
Apache License 2.0
1 stars 0 forks source link

Automating Data Upload, Deletion (General Data Update Operations) #33

Closed lbeckman314 closed 6 months ago

lbeckman314 commented 11 months ago

Data Operations

Currently once a project has been created and users given access to that process the process for uploading a dataset to the Gen3 platform involves the following steps:

End User Steps:

  1. Import the metadata

    gen3_util meta import dir <DATA DIRECTORY> --project_id aced-foo
  2. Upload the metadata to the S3 Bucket

    gen3_util files cp --duplicate_check --project_id aced-foo manifest/DocumentReference.ndjson bucket://aced-ohsu-production

Manual ETL Pod/Chart Steps:

  1. List the metadata available on the S3 Bucket

    gen3_util meta ls
  2. Download the desired metadata

    gen3 file download-single <DID>
    # Downloads _aced-foo_meta.zip
  3. Move the metadata to the $HOME/studies directory

    cp _aced-foo_meta.zip studies 
    unzip studies/_aced-foo_meta.zip
  4. Upload the data and metadata to the Gen3 endpoint

    ./scripts/load_all foo

Next Steps & Potential Improvements

Ideally the the ETL pod/chart steps would be automated in one of the following ways:

The first two options would be preferred to minimize the time between when the data was uploaded and when it is made visible on the Gen3 platform for the end users.

Data Deletion

The same steps for data upload should also apply for data deletion (i.e. an end user should be able to run a CLI command to delete a file and the ETL pod should automatically delete the data from ElasticSearch).


Resources

lbeckman314 commented 10 months ago

ETL drawio

lbeckman314 commented 10 months ago
ETL-Argo-Load

Source

lbeckman314 commented 6 months ago

Resolved and supported with gen3_util (see upload steps here)!