Supporting persistent data volumes accessible by all executors

ga4gh / task-execution-schemas

Apache License 2.0

82 stars 29 forks source link

Supporting persistent data volumes accessible by all executors #186

Open uniqueg opened 2 years ago

uniqueg commented 2 years ago

Being able to have a TES implementation have access to a persistent data volume is something that the Greek ELIXIR node requested (see here for more details). A potential use case is for a TES implementation that is deployed in an environment where it repeatedly runs specific sets of tasks and using the same reference data over and over again.

The current specification of tesTask.volumes do not meet this requirement as it states that they "are initialized as empty directories".

A similar request was/is also discussed in Cromwell: https://github.com/broadinstitute/cromwell/issues/2190

I don't really have in mind what this could look like, but I thought I would open this issue so that we could discuss.

Thanks to @zagganas and @hex43ver

uniqueg commented 2 years ago

Some random thoughts for the discussion:

Should there be a way for a client to ask a TES deployment to make a particular object persist? If so, where exactly, for how long, how to communicate that TES did so etc.?
Should there be a mechanism to populate a persistent volume in bulk or should that be outside of the specs?
How would a client know what persistent data a TES deployment has? Could we do this via DRS?
Is maybe this whole feature outside of the scope of TES and we should just find a TES-compliant workaround that can be realized in a given TES implementation?

noooonee commented 2 years ago

Thank you @uniqueg . This issue described our request precisely, we have human genome files (~20GB) and some static internal binary data need to be one-time-pre-populated before data processing, and want to minimize file copy consumptions.

This doesn't necessarily need to change the TES API, if there are implementations can provide such capability. But if TES API can design a standard presentation, can help a lot for other implementations.

For the syntax, my personal thought is, maybe the docker volume expression is good enough?

    name-of-a- custom-volume:/path-inside-container
    path-from-runtime-node-host:/path-inside-container

There might be a lot more ideas come out, like the docker volume bind propagation concepts, I can understand that TES must limit the scope at a maintainable level.