ga4gh / task-execution-schemas

Apache License 2.0
80 stars 27 forks source link

How do we define the technical capabilities of a given TES API #188

Open patmagee opened 1 year ago

patmagee commented 1 year ago

TES provides an abstraction layer on-top of the smallest unit of work in the workflow execution stack and interfaces directly with some sort of compute infrastructure. At the moment, user's of the TES API need to know and understand the technical constraints of the underlying compute environment in order to know what sort of capabilities a given TES api has. This is a problem in so far as it leaks the implementation details and requires the user to know information prior to using the API.

IF we push the boundaries of WES and TES the natural conclusion is federating work submitted to WES across different TES backends that fit the mould of the requested task resources/capabilities. This implies A LOT of machine -> machine interaction where it is a WES api determining where to send work. In order to accomplish this we really need a way to describe the complete list of technical capabilities of a TES backend. #186 is a good example of a specific technical ability that would need to be described in some way. Other examples would be GPU support (and what line GPUS), the range of CPU's or supported data types etc etc.

You could imagine that a workflow run through WES could take each individual task (in CWL/WDL perlance) and use heuristics to map it onto a particular TES backend. I know that some work has been done by @uniqueg and his team on building a Gateway TES that does a similar role, but I wonder what would be required to make any TES api able to participate in this gateway approach

uniqueg commented 1 year ago

Thanks @patmagee! I/we agree that this is a very important discussion to be had. The ability to provide arbitrary backend params in v1.1, coupled with the ability to provide some resource requirements and the broadcasting of some capabilities (supported storage protocols) via the service info endpoint might serve as a blueprint/starting point for that.

To address the constraints you mentioned, in a next step, we could try to agree on a controlled vocabulary for capabilities and resource requirements, ideally starting with those that can be mapped well across a wide range of backends. Then extend the service info accordingly to broadcast these capabilities. A nice side effect of that is that one could find appropriate TES instances dynamically via the Service Registry API.

Apart from the gateway TES, we have previously also worked on a task distribution logic that takes into the account the location of the data to try to send the compute to the data. It also considers costs and expected completion time. To make that work, we have deployed a GET /tasks/info endpoint as a sidecar next to our TES instances. Essentially, sending the resource requirements to that endpoint will tell the client the expected costs and queueing time. It's a very naive model, but it did address, in a first attempt, the use cases of minimizing data transfer and balancing loads over multiple TES services.

For more details, have a look here: https://github.com/elixir-cloud-aai/TEStribute