Currently once a project has been created and users given access to that process the process for uploading a dataset to the Gen3 platform involves the following steps:
End User Steps:
Import the metadata
gen3_util meta import dir <DATA DIRECTORY> --project_id aced-foo
Ideally the the ETL pod/chart steps would be automated in one of the following ways:
a Sower job would be started and run with the required configuration for data upload
a webhook or other signal would be sent to the ETL pod/chart with the required did to start the steps above
a listener in the ETL pod/chart would periodically scan for new projects or new data upload
The first two options would be preferred to minimize the time between when the data was uploaded and when it is made visible on the Gen3 platform for the end users.
Data Deletion
The same steps for data upload should also apply for data deletion (i.e. an end user should be able to run a CLI command to delete a file and the ETL pod should automatically delete the data from ElasticSearch).
Data Operations
Currently once a project has been created and users given access to that process the process for uploading a dataset to the Gen3 platform involves the following steps:
End User Steps:
Import the metadata
Upload the metadata to the S3 Bucket
Manual ETL Pod/Chart Steps:
List the metadata available on the S3 Bucket
Download the desired metadata
Move the metadata to the $HOME/studies directory
Upload the data and metadata to the Gen3 endpoint
Next Steps & Potential Improvements
Ideally the the ETL pod/chart steps would be automated in one of the following ways:
did
to start the steps aboveThe first two options would be preferred to minimize the time between when the data was uploaded and when it is made visible on the Gen3 platform for the end users.
Data Deletion
The same steps for data upload should also apply for data deletion (i.e. an end user should be able to run a CLI command to delete a file and the ETL pod should automatically delete the data from ElasticSearch).
Resources