Implement "Upload" feature

IUSCA / bioloop

Scientific data management portal and pipeline application template

Other

4 stars 2 forks source link

Implement "Upload" feature #99

Open charlesbrandt opened 1 year ago

charlesbrandt commented 1 year ago

Occasionally data products are generated outside of a given instance of Bioloop. In this situation, operators need a way to manually import or ingest the data product.

In CMG, we call this process uploading, however it is more similar to ingesting from a custom path:

If the operator derived the data product from an existing dataset (raw data source) that already exists in the data portal, there should be a way to associate the data product being uploaded with the source dataset.

@ryanlong89 has also implemented related functionality in the context of NGVB that successfully manages transferring large data from the operator's local machine to the server. @ryanlong89 , could you add a pointer to that functionality here?

@ri-pandey , once you've had a chance to review existing solutions, please document and describe the approach you plan to take to implement the feature here in this issue (before starting actual development).

FYI @deepakduggirala

ryanlong89 commented 1 year ago

Here in NGVB is the UI where we start the upload:

https://github.com/IUSCA/NGVB/blob/main/ui/src/components/forms/UploadStorageData.vue

That handles chunking the file and sending each piece to the backend to store the file chunks in a folder. To here:

https://github.com/IUSCA/NGVB/blob/main/api/routes/data.js#L23

And finally calls the merge endpoint which actually now calls a worker to finish up the merge (which we needed for larger files):

https://github.com/IUSCA/NGVB/blob/main/api/routes/data.js#L43

The actual file merge is handled in a python worker script you can find here:

https://github.com/IUSCA/NGVB/blob/main/worker/merge.py

ri-pandey commented 1 year ago

After talking to Ryan and Deepak it seems reasonable to go with the overall strategy that Ryan implemented in NGVB, with some tweaks.

We could have an endpoint created on the Colo node to upload the file chunks, from where a worker could merge the chunks into a file, create the data product, and run subsequent workflow steps.

deepakduggirala commented 1 year ago

@ri-pandey You may also need to consider

authentication and authorization for the upload API (is the requestor authenticated? if so, do they have access to upload data against this dataproduct / create a new dataproduct?)
How to talk to the app api after the upload is succeeded / failed?
CORS. if you plan to invoke this API from browser
NGINX quirks for larger file uploads

deepakduggirala commented 1 year ago

For authentication, while making an upload request from the browser, send the JWT issued by app API (when the user logs in). The upload API needs to verify the signature using app api's public key. For this to work, a copy of the public key has to be made and placed in the repo of upload API.

Another approach is to generate a new token using Signet oauth service similar to how secure download works. Code for validating the token is in the secure download API. The user role needs to be in this token for authorization.

ri-pandey commented 9 months ago

Needs testing in a remote test environment.