ga4gh / workflow-execution-service-schemas

The WES API is a standard way to run and manage portable workflows.
Apache License 2.0
82 stars 38 forks source link

Pass TES endpoints to WES request #170

Open uniqueg opened 3 years ago

uniqueg commented 3 years ago

For WES implementations that support TES (e.g. cwl-WES), there is currently not an mechanism to provide a list of acceptable TES endpoints, requiring TES endpoints themselves or a mechanism to obtain them (e.g., from a GA4GH Service Registry) to be pre-configured in a given WES instance.

To enable effective task-level compute federation, it would be beneficial to add a way to pass either:

The mechanism could then be used to (1) allow a WES client to designate TES backends that the WES can use and (2) provide inputs to a task distribution middleware (e.g., TEStribute) integrated into a WES instance to "smartly" determine the best TES for a given task, e.g., the TES instance that is closest to the bulk of the data, has the lowest load.

The property should be optional in general, but any given WES implementation may choose to require it. Here's a quick-and-dirty draft for a corresponding schema:

TesBackends:
  type: object
  tes_urls:
    type: array
    items:
      type: string
  service_registry_urls:
    type: array
    items:
        type: str
  description: TES backends to be used to execute the workflow run.

Something like the following could then be added to the RunRequest schema:

tes_backends:
  $ref: '#/definitions/TesBackends'
  description: TES backends to be used to execute the workflow run.

Note that one could of course skip the addition of a way to pass Service Registry URLs, as the client application could make the call to get TES URLs itself before calling WES. However, I have added it as it will allow a WES implementation to dynamically obtain available TES instances during execution. This might be beneficial in some cases, especially for long-running workflows. However, it is not really essential and makes the added schema more complicated. A simpler alternative would be to just add the following property to the RunRequest schema:

tes_backends:
    type: array
    items:
      type: string
      description: Root URL of a TES API.
  description: TES backends to be used to execute the workflow run.
patmagee commented 3 years ago

@uniqueg This is an interesting idea that would definitely provide greater synergy between WES and TES (something I think the spec greatly needs). Also, this may be a great path forward for defining a language agnostic workflow engine . I do have a few questions (which you may or may not have thought of) that may help flesh out this idea some more.

  1. Is there a reference implementation that allows dynamic configuration of TES engines?
  2. How does this fit into the current landscape of WES implementations. Would we expect existing engines to be able to be dynamically configured or is this more of a "opt in" feature.
  3. Have you put any thought of how authentication would work? The WES engine would need appropriate permissions to submit and poll the TES backend, at least until the task is done, if not longer (depending on how the engine works).
    • Would it be expected that a WES engine can communicate with ANY TES api without a pre-existing relationship with it? (ie plumb down the user's credentials)
    • Would a WES be able to define a list of pre-existing TES's that the user can submit jobs to? This would avoid a lot of auth issues and potentially even data access issues
  4. How would data access work?
    • How does the WES engine access the runtime logs? Would the WES need to move the logs or stream them to a separate accessible location?
    • How would data access for the end user and between steps work? Would it be expected that every TES API work with DRS to be able to get access to the data? If so, what credentials would WES use?
      1. How would we be able to avoid data egress between steps? Theoretically, allowing the user to define any TES would allow them to run task in different environment or clouds, which could incur huge costs if not managed properly
uniqueg commented 2 years ago

Hi @patmagee, unfortunately I'm not on top of my GitHub notifications at all and didn't see your answer. They are all excellent points! I think we have thought about all of them, but I am not sure that we came to great/convincing solutions for some of them. Anyway, let me give it a try.

  1. Is there a reference implementation that allows dynamic configuration of TES engines?

Not yet, but we have most of the pieces and just need to finalize some things and then tie them together. Specifically, we are working on an ELIXIR Cloud & AAI service registry (an implementation of the GA4GH Service Registry API) that would then hold all our Cloud API deployments, incuding TES backends. As I had mentioned in the original post, we have this very naive task distribution logic package/service TEStribute and we also have a sort of gateway TES service (proTES; 75% done) where we could plug that as middleware and then implement a client that fetches available TES instances from the service registry dynamically (currently, TES instances are hard coded). It will probably work to some extent but (a) TEStribute makes some assumptions about DRS and TES that are beyond current specs, (b) it would help if the GA4GH Service Registry supported fetching only services that are live and healthy and that a user is actually allowed to use, (c) the whole AAI flow is still not well/fully designed and won't be interoperable, so that requires discussions here, with TES, with Cloud WS in general, with FASP and with Passport... Also, as I mentioned, TEStribute is really quite naive at the moment, and it doesn't consider restrictions on where data can move (and if at all) etc. Still, it should be good enough to be able to prototype this at some point, possibly at the next Plenary in fall.

  1. How does this fit into the current landscape of WES implementations. Would we expect existing engines to be able to be dynamically configured or is this more of a "opt in" feature.

I would say it should certainly be an opt-in feature. WES on its own, even locally deployed and even in a single-tenant environment, has good use cases, and we don't expect that to change. Federating compute at the level of workflows, tasks and even individual computations (a whole other topic!) is important, because it serves use cases that otherwise couldn't or couldn't easily be done. However, I don't think it is likely to be dominating how computation is done in the life sciences anytime soon (if ever).

  1. Have you put any thought of how authentication would work? The WES engine would need appropriate permissions to submit and poll the TES backend, at least until the task is done, if not longer (depending on how the engine works). Would it be expected that a WES engine can communicate with ANY TES api without a pre-existing relationship with it? (ie plumb down the user's credentials) Would a WES be able to define a list of pre-existing TES's that the user can submit jobs to? This would avoid a lot of auth issues and potentially even data access issues

As I mentioned, the AAI flow is still not really worked out, to a large extent because we felt that Passports and how they are to be consumed in WES and DRS context is still very dynamic. Basically, we are waiting for this to settle down a bit. However, we have thought about it, of course, and to answer your more specific questions:

  1. How would data access work? How does the WES engine access the runtime logs? Would the WES need to move the logs or stream them to a separate accessible location? How would data access for the end user and between steps work? Would it be expected that every TES API work with DRS to be able to get access to the data? If so, what credentials would WES use?

In principle, data access would work just like in WES.

  1. How would we be able to avoid data egress between steps? Theoretically, allowing the user to define any TES would allow them to run task in different environment or clouds, which could incur huge costs if not managed properly

Hmm, I'd venture that running tasks across different backends in this setup is a feature rather than a problem (think of load balancing, running workflows that during different steps need to access data at different locations that cannot move, etc.). So how can we ensure that costs are minimized? In TEStribute, we are basically allowing users to decide whether to prioritize costs or runtime for a given task. TES then has a sidecar service (or possibly future TES endpoint) that users can query with their task resource requirements and DRS URIs and which returns the estimated costs (sum of compute and data transfer and storage costs) and runtime (sum of waiting time, expected runtime and data transfer time) for each combination of input locations and TES instance (more details in this slide deck. If staying on a given TES for different tasks is MUCH cheaper than using another TES, compute will happen at that same TES (unless it's gonna be much slower and the user specified that they value a lower runtime over higher costs). That costs vs runtime parameter in our implementation is a float between 0 (cost-optimized only) and 1 (runtime-optimized only). Putting a reasonable value there (not 1!) should ensure that costs won't skyrocket.

One other thing that leaves us a bit clueless is how to pass that extra info (like the costs/runtime param and whether they want to use a TES network at all) from the user to WES to TES. And how to implement it such that it won't become overly complicated to the user. But after all, this is something to figure out (at least partially) even in a (federated) WES-only world. Using TES would just add another layer to that that may help curb costs, balance loads and execute workflows in cases where data that cannot move resides at different places.

Anyway, a lot to be discussed and done here, still, but good to have a start. And we might be able to have a prototype at some point next year, if things go well.

I am attaching also this image to make the general design we envision a bit clearer:

2021_12_09-elixir_cloud_schema