ga4gh / task-execution-schemas

Apache License 2.0
80 stars 28 forks source link

Content Output #102

Open kellrott opened 6 years ago

kellrott commented 6 years ago

One of the more complex issue about getting workflow engines deployed is that many of them have to read the contents of at least one file to proceed. For CWL, there is the potential of a cwl.output.json that can be produced by the tool, or for Galaxy there is a galaxy.json file that can have meta-data embedded in it. WDL has a number of function like read_string and read_json that are available in the workflow description. Each one of these use cases requires additional code on the workflow engine side to download the file from object store before being able to proceed with the next step of the workflow. We've already added a content for the Input message, which allows for the injection of config files without having to go through the object store. I propose we add a correlated content field for the Output message. We would also an additional FileType (something like FILECONTENTS) or some other flag to indicate the executor should read the contents of a text file into message and make it available via the TES AP, rather then move the file to the object storeI. We can include some min supported length (like 64KB) past which the full contents is not expected to be cached. I think the addition of this capability would save a lot of time and code on the client/execution side.

buchanae commented 6 years ago

There are many storage-related requirements for building a workflow engine:

TES can't take on the responsibility of being a workflow engine API, and it shouldn't bother providing partial solutions to these problems because they come at a high cost to TES implementors and spec complexity.

psafont commented 6 years ago

TES can't take on the responsibility of being a workflow engine API

I fail to see how storage-related information is workflow-exclusive. In a situation where the task executor is in a remote location the information for accessing remote or secured storage needs to be relied to it in some way.

Take an example about auth: a workflow service may be able to launch tasks to different, remote compute/data hubs, with disjoint sensitive data (think human data). In this case the task runners must be able to access sensitive the data because the user has access to it while the workflow is simply not able to access the data because is not available from outside those hubs.

globbing poses less of a problem compared to auth, but it would still be beneficial as it can help reduce traffic when data needs to be saved into remote storage between data hubs.