IUSCA / bioloop

Scientific data management portal and pipeline application template
Other
5 stars 2 forks source link

Provide data delivery mechanism via Globus #106

Open charlesbrandt opened 1 year ago

charlesbrandt commented 1 year ago

Globus allows data transfer between local storage targets and targets outside of the university. Instead of downloading data directly to a desktop client via a browser, or providing a path on local storage where the data has already been staged, this feature would allow data transfer using Globus. Some questions to explore first:

https://www.globus.org/
Research data management simplified. | globus

https://www.globus.org/platform/services/flows
Globus Flows | globus

https://kb.iu.edu/d/bdqp
Use the IU Globus Web App to transfer data between your accounts on IU's research computing and storage systems

ri-pandey commented 11 months ago

IU's Globus instance

  1. Machines under IU's research computing infrastructure can be accessed within Globus as 'Collections'.
  2. Data transfer from one Collection to another within IU's Globus instance is possible (example - Slate-Scratch to SDA).
  3. Data transfer from a Collection in IU's Globus instance to a different Globus instance is also possible (example - Slate-Scratch to a Collection in Harvard's Globus instance ).
  4. A user can access data spread across multiple Globus instances within a single Globus instance (this is how data is transferred from one Globus instance to another). This is made possible by linking multiple identities to a Globus account. As an example, if a user has accounts in both IU's Globus instance as well as Harvard's, they could login to IU's Globus instance, link their Harvard Globus account's identity to it, and this way, be able to access the data in both IU's and Harvard's Globus instance while only being logged into IU's Globus instance.
  5. Institutional logins are available
  6. Data can be shared to researchers even if they do not have an institutional login. Although this might be a paid feature.
  7. Within Globus, users can be grouped into groups. Groups can have their own roles and policies for restricting access.
  8. 'High Assurance Collections' offer some additional protection measures for working with sensitive data, like forced encryption of data during transit.
  9. A Collection can also be mapped to an external storage system, like AWS, Google Storage, etc.
  10. Workflows can be created within Globus (using JSON) to automate complex tasks (for example, a task that copies a file to an intermediate location, before moving it to the destination).
  11. Globus offers APIs that are accessed via oAuth flows that we could leverage if we really want to build Globus' features within the Bioloop UI.
  12. Python and JS SDKs are available to use Globus' features programatically.

Finally, Globus offers a lot of flexibility in transfers, like checking completion percentage, status, overriding checksum verification, resuming failed transfers, etc. These features are currently available in the Globus UI, so it doesn't make sense to replicate these features in Bioloop.

charlesbrandt commented 10 months ago

Upon further discussion and review, we would like to minimize the amount of duplication of effort to enable data delivery from Bioloop through the Globus network.

Bioloop operators should have a way to specify which Globus users are allowed to read data from a specific Bioloop project. We expect this will be a one-way operation. Data can be read from Bioloop and and delivered via Globus. We do not expect to receive data via Globus and write/ingest to Bioloop.

In Bioloop, once a project has been configured for sharing via Globus, all other operations should be handled by Globus. To that end, we anticipate needing to configure a Globus Connect Server to handle data delivery to other Globus endpoints:

https://www.globus.org/globus-connect-server

I believe this is what other services like SDA or Slate are running to facilitate working with those resources through Globus. This will likely require a subscription to run the globus connect server:

https://www.globus.org/subscriptions

We would like to confirm that this is a viable path for delivering data managed in an instance of Bioloop via the Globus network.

We would also like to understand what needs to happen for the globus connect server to know what Bioloop projects are available for sharing, and which users should be granted read access.

We will need to learn which subscription level is appropriate for this use case.

ri-pandey commented 10 months ago

@charlesbrandt There are APIs which should make the above possible.

'Sharing' in Globus (i.e. sharing with a user as opposed to transferring data to an endpoint) takes place via Guest Collections. Guest Collections can be created by the user desiring to share their data. An existing Guest Collection may also be used. Once a user or group of users have been grated read access to the Guest Collection, they should be able to access the data in their Globus web instance.

Here's the flow I am envisioning within Bioloop:

  1. User visits a project or dataset that they want to share
  2. Upon choosing the 'Share via Globus' button, they walk through a stepper that has them authenticate into Globus using IU Login (from within Bioloop) and create a Guest Collection, or select an existing one.
  3. Next, they share the dataset/project in question with the Guest Collection.
  4. Finally, they select a Globus user or group of users which will be granted read privilege to the Guest Collection. In our case, this would be external collaborators.

The steps I listed above are achievable through the Globus API.

This seems to be a viable path.

Note - Currently, the option to create a Guest Collection is disabled for IU's Globus instance. This may be admin-level setting that would need to be disabled by IU's Globus admins. I also believe we may already be on the subscription needed to take advantage of the Share feature - but the option to Create Guest Collections would need to be enabled before we can make use of the Share feature.