ga4gh / cloud-interop-testing

Interoperable execution of workflows using GA4GH APIs
Apache License 2.0
9 stars 8 forks source link

Write up background on cart service based on what was found by talking to Elixir, Terra, U. Chicago, etc. #100

Open briandoconnor opened 4 years ago

briandoconnor commented 4 years ago

First steps:

cdvoisin commented 4 years ago

Kurt Rodarmer of NCBI has stepped forward to own this issue. Thank you Kurt!

rishidev commented 4 years ago

@mbarkley to send web sequence diagrams to Kurt

mbarkley commented 4 years ago

@kwrodarmer apologies for taking so long to follow up on this, but here is a link to an image of the web sequence diagram of the SRA token process (or at least my understanding of it after our discussions): https://drive.google.com/file/d/1NMwxCKir9jca4hdcw-ku1Uy15E2uWHAs/view?usp=sharing

That image was generated at sequencediagram.org using this text:

title SRA (rough sketch)

actor Researcher
participant Passport Endpoint
participant Run Selector
participant Cart Service
#note over Cart: has permissions
participant SDL
participant Compute Env
participant Signed URL Redirector
note over Compute Env: WES, VM, etc

Researcher->Passport Endpoint: Do authentication right away to get user passport
note over Passport Endpoint: This part is GA4GH AAI/Passport flow
Researcher->Run Selector: User does faceted selection
note over Researcher: Token management is happening in web browser
Researcher->Cart Service: Send selection and passport to mint cart token
note over Cart Service: Produces newly signed/minted token with copy of SRA permissions
note over Cart Service: It can down-scope to particular visas required for the given run-selection
Cart Service->Researcher: Receive down-scoped token (or DBGAP passport that is not downscoped)

note over Researcher: There is a "simple" exit where the down-scoped token\n is downloaded to the user machine,\nused as a bearer token to download data

note over Researcher: Start simple case
Researcher->SDL: Resolve access on cloud (w/ cart token)
note over Researcher: End simple case

note over Researcher: Start "managed compute" case
Researcher->Compute Env: Start compute environment
Researcher->Cart Service: Rebind user cart token to "bound cart token"
Compute Env->SDL: Request to SDL with bound cart token
SDL->Compute Env: Respond with URL
note over SDL: Can return "naked URLs" (direct to object servers such as NCBI)\nwith no other auth tokens, or the object is in cloud. Cloud objects\n have three cases: open access, user-pays, and controlled access.\nFor the first two cases, urls to the cloud objects are given.\nFor the controlled access case, a signed URL to "Signed URL Redirector"\nservice is given. That service will redirect valid inbound HTTP requests\nto the actual resource (via a cloud-provider signed URL).
kwrodarmer commented 4 years ago

I'd like to start with a conceptual overview of how NIH sees the cart concept.

First, a cart is a type of dataset. The objective is to have a container object whose contents may be created and managed and used wherever a dataset would be used within a workflow.

Second, there is a notion that the dataset might be bound with authorizations such that the cart object itself becomes standalone. This is the type of cart implemented by the SRA, and has plusses and minuses. I mention it now because the ability to carry authorization has an affect on the representation of a cart.

The simplest idea of a cart as a selection object is to hold a set of object descriptors. The GA4GH notion of object descriptor is a DRS id. Since a DRS id can be either an object or a bundle of objects (yes, and bundles), a cart containing DRS ids can be as explicit or expanded as appropriate.

A cart object in theory has a limitless upper bound on the number of items it may contain. That said, engineering practice requires us to impose some limits within a standard. There are limits that may be visible to the end user and others that are not visible. But size is a definite engineering concern. A cart object - whatever its form - should be assumed to make use of POST methods during transport. Additionally, we should consider that a cart can be paginated, which implies that multiple objects would need a common id for joins and some spec for indicating the subset they represent.

In concept, a cart could then be as simple as a JSON object with an object id, an optional pagination spec, and a list of DRS ids. If the latter are not universal enough to represent all contents, then other URI schemes can be incorporated. This is particularly the case for a representation of a non-deterministic query which at present is not representable under DRS.

We will want to add a grouping facility for factorization because size really is a concern here, and if it is possible to factor out common substrings, we will probably be happier than not. Compression is another possibility. Both JSON and gzip suffer from resistance to streaming, and both benefit from the ability to paginate.

I suggest that the basic concept of cart be kept separate from passports and visas. When used, they will be accompanied by a passport with visas that carry the authorization needed to access the blobs they identify. That said, the cart concept can represent a very concise and proper definition of a researcher's needs and intentions, can be easy and intuitive to build, and allows automation to map from the dataset back to the visas needed for their access. The passport and visa token generation system can be augmented to take a cart as a downscoping indicator so that the resulting auth tokens are minimized.

Finally, it is possible and in some cases advantageous to bind cart contents into an auth token. The SRA does exactly this already by downscoping the visas to the minimum set, and then recording the exact ids of accessible objects all in a single token. This token carries both dbGaP authorizations and an explicit set of object designations that come from the intersection between the selection in the cart with the authorized datasets in the visas. The result is a precisely scoped auth token.