lbeckman314 commented 11 months ago

Data Operations

Currently once a project has been created and users given access to that process the process for uploading a dataset to the Gen3 platform involves the following steps:

End User Steps:

Import the metadata

gen3_util meta import dir <DATA DIRECTORY> --project_id aced-foo

Upload the metadata to the S3 Bucket

gen3_util files cp --duplicate_check --project_id aced-foo manifest/DocumentReference.ndjson bucket://aced-ohsu-production

Manual ETL Pod/Chart Steps:

List the metadata available on the S3 Bucket
```
gen3_util meta ls
```

Download the desired metadata

gen3 file download-single <DID>
# Downloads _aced-foo_meta.zip

Move the metadata to the $HOME/studies directory

cp _aced-foo_meta.zip studies 
unzip studies/_aced-foo_meta.zip

Upload the data and metadata to the Gen3 endpoint
```
./scripts/load_all foo
```

Next Steps & Potential Improvements

Ideally the the ETL pod/chart steps would be automated in one of the following ways:

a Sower job would be started and run with the required configuration for data upload
a webhook or other signal would be sent to the ETL pod/chart with the required did to start the steps above
a listener in the ETL pod/chart would periodically scan for new projects or new data upload

The first two options would be preferred to minimize the time between when the data was uploaded and when it is made visible on the Gen3 platform for the end users.

Data Deletion

The same steps for data upload should also apply for data deletion (i.e. an end user should be able to run a CLI command to delete a file and the ETL pod should automatically delete the data from ElasticSearch).