ga4gh / task-execution-schemas

Apache License 2.0
80 stars 28 forks source link

Request for clarification: Volumes field #64

Closed mbookman closed 6 years ago

mbookman commented 7 years ago

The task_execution.proto indicates that tasks can contain Volumes, but it is not clear how this field should be used. Can this be expanded:

  // Declared volumes.
  // Volumes are shared between executors. Volumes for inputs and outputs are 
  // inferred and should not be delcared here.
  repeated string volumes = 10;

(Note there is a typo: delcared.)

What kind of string should one put here? Is it environment-specific? For Google Cloud, for example, can I specify that a disk needs to be created of a certain size? Is this a mechanism for mounting an existing disk read-only? Is this intended for a Grid Engine environment where an NFS volume needs to be mounted?

Thanks!

buchanae commented 7 years ago

47 and #30 are related.

As I understand, a volume string is a path inside the container, and the path on the host system is implementation-specific. For example, volumes = ["/path/to/volume"] would translate to docker run -v /path/to/volume.

Do these volumes exist primarily to share data between executors?

Awhile back, Volume was a message and included sizeGb and source, which I think mapped more directly to cloud concepts.

Currently, a single disk size requirement lives in the Resources message. I think the idea was that this is more simple for non-cloud (e.g. Grid Engine + NFS) environments, but maybe we strayed too far here and should rethink this?

buchanae commented 6 years ago

@mbookman I think we've cleared up some of the confusion around volumes by simplifying the spec and adding better documentation.

If there's nothing left to do here, can you close this issue? Or if there's more left to do, can you clarify please? Thanks.

mbookman commented 6 years ago

Sure. I think all of my issues with naming still apply, but I can't point to a clearly better replacement.

As a note, in Pipelines API v2alpha1, the analog is the "path" field inside the "mount" element. When porting dsub to use v2alpha1, I similarly just needed to read the documentation to clarify.

I think the confusion comes naturally from the fact that a "volume" or "mount" without clarification can apply to a physical machine, a virtual machine, or a docker container. Your probably just going to need documentation to clarify, unless you name it executor_volume. Even then, you'll still need some amount of documentation about the lifespan of the volume.