ga4gh / cloud-interop-testing

Interoperable execution of workflows using GA4GH APIs
Apache License 2.0
9 stars 8 forks source link

Getting results back from DNAStack WES #111

Open ianfore opened 4 years ago

ianfore commented 4 years ago

How do you get back the results of a workflow passed to the DNAStack WES server. Can this be done through the WES API? There are a couple of clues The run log gives a working directory and the command line lists an output file name. So we might guess, for the MD5 example, that the output goes to gs://workflow-bucket/md5Sum/run_id/output.md5

Another clue is in the log from the completed run

        "md5Sum.md5": ""
    }

But the value there is unspecified.

The only thing I found in the WES spec on outputs is in the documentation of workflow_params which are: The workflow run parameterizations (JSON encoded), including input and output file locations However there are no details on how to provide the location, or how to give the workflow authorization to write to that location. Maybe that last complication is necessary. From the above it seems that the output goes to workflow_root. In which case the authorization problem is mine. Can I access workflow-bucket?

stdout also goes to workflow_root. gs://workflow-bucket/md5Sum/run_id/call-calculateMd5Sum/stdout The WES documentation states that the stdout value is: A URL to retrieve standard output logs of the workflow run or task. ... Should be available using the same credentials used to access the WES endpoint. Can the access_token used for the DNAStack WES be used to access the stdout file via Google Storage API? I don't expect so, but trying the following URL http://storage.googleapis.com/workflow-bucket/md5Sum/run_id/call-calculateMd5Sum/stdout I passed the access token used for workflow as a Bearer token. It returned

<Error>
    <Code>AuthenticationRequired</Code>
    <Message>Authentication required.</Message>
</Error>

Note also that the value for stdout (and stderr) returned by the implementation is a URI not a URL as the specification requires.

patmagee commented 4 years ago

@ianfore the WES server simply mirrors the workflow execution engines output. The MD5 sum being empty is likely an issue with the signed urls. The task is completing successfully but my guess is the file is never actually being localized to the VM (similar to the GWAS workflow) so the stdout from the task is actually empty.

With WDL, returns are typed (according to the WDL typing specification) and can be any valid JSON value (string, int, boolean, float, etc). File are represented by a string, however there is no information communicated that they are actually files, other then the fact that the WDL itself defines that they are files. WES/WDL is not actually concerned about how a user identiers or even accesses these files. For the DNAstack WES server you need access to the bucket in GCP in order to access the file based outputs.

The lack of information on how outputs are structured (and inputs for that matter) are one of the things which I take issue of with the WES specification. In order for proper interopability between workflow platfaorms, this needs to be defined in the WES specification and not up to the individual language. AT the moment, theres no consistent way to represent this information

A URL to retrieve standard output logs of the workflow run or task. ... Should be available using the same credentials used to access the WES endpoint.

Unfortunately this is a bit more challening then it sounds. We did not previously do this but looks like we could probably figure it out

ianfore commented 4 years ago

The WDL approach to returning values seems reasonable behavior for WES too. The following is from a working MD5 example off your WES server and seems fit for purpose. No need to store files at the WES server or retrieve them. "outputs": { "md5Sum.md5": "AjiSbx/31i1wUqQylo3TeQ=="} I agree it would likely help if the WES spec provided the capability for a workflow definition to include the model for how the outputs are structured.

patmagee commented 4 years ago

Its a difficult issue, because WES is yet one more abstraction on other very different abstractions (CWL,WDL,Nextflow etc), which is an abstraction on top of the underlying execution engine. There definitely is a line between being too heavy handed on specifying how inputs/outputs are defined and not being assertful enough

In the WDL community, this was a hotly debated topic for the longest time. Originally We did not mandate a minimal implementation requirement for inputs or outputs, but left it up to the specific execution engine. Our thinking here was that to be too prescriptive would actually restrict adoption of the Specification by new engine implementations. Additionally in our minds it seemed to step outside of the bounds of the WDL specification, since WDL is not concerned about HOW it is run, only what specifically is run.

Over time though, it became clear that this stance introduced ALOT of problems.

  1. As user's published WDL's on Docstore or other places, they also published attached inputs. Without the spec defining how these inputs were structured, it essentially locked the example inputs into being run on only a single execution engine. BY mandating a specific structure for inputs (and outputs) user of WDL can now share or publish workflows with their inputs and have the guarantee that different execution engines will at least be able to read the inputs file (access is a whole different story)
  2. Engines were converging on a common pattern (cromwell style) for supplying inputs and getting outputs, but sublte differences made it impossible for interop, so any WDL launcher would need to know what execution engine it was talking to
  3. Creating an automated testing and compliance framework was impossible because of these differences. There are currently efforts underway in the community to build a compliance sweet for engines, to make sure they cover all language features. in order for this to work a common inputs/outputs format MUST be defined

We recently voted a specific "Cromwell-style" inputs/outputs (https://github.com/openwdl/wdl/pull/357) as the required format that all engines must minimally support. We do not restrict additional input/output formats.

ianfore commented 4 years ago

Picking up on how files from a WES run might be accessed.

For the DNAstack WES server you need access to the bucket in GCP in order to access the file based outputs.

For the most part direct bucket access isn't being considered and I can see why not.

You had asked elsewhere if DRS would be an option. Certainly worth looking at. In fact we can test out aspects of this now. Seven Bridges DRS makes any file in a workspace available through DRS.

One of my FASPScripts uses the SB API directly to submit a task rather than WES. I did it for a placeholder until an SB WES server is available. The resulting file is stored in my SB project. I added a script to check for task completion, get the file (DRS) ids and use DRS to download the resulting files. Again it has to use the SB API to query the task and get the ids, but it wouldn't need any WES changes to do this. It would work just as you outlined for WDL and return the DRS id as a string.

Files are represented by a string, however there is no information communicated that they are actually files, other than the fact that the WDL itself defines that they are files.

The case for stricter typing you've made above could well be applied though. The WES server and the organization within which it exists would need to provide the DRS ids and a server to get them. However, the SB set up suggests that's not a huge lift beyond what they do anyway.

This does lead to revisiting some assumptions about DRS, but I'll hold that for elsewhere.

ruchim commented 3 years ago

To me it sounds like we're talking about a GET /outputs for a given workflow ID? Where the response type is a key/value mapping of name/and output value (be it a file ref, string, int etc)

ianfore commented 3 years ago

Some progress made on this in this notebook albeit on a different WES implementation. A difference from how it was being discussed above is that you don't explicitly say "put the result in this DRS location" . It just happens. You can then pick up the results from the DRS service run by the same provider. You have to do a separate authentication for the DRS service though - the irony being that it's the same set of credentials as for the WES. The other part is that the script did have to do the download rather than having it pushed. Other than that it seems very convenient to use.

Raises the question whether a WES server would also have a companion DRS for results. I suspect that for the most part if you have compute privileges on a system you will have storage privileges too. This might be a common pattern and therefore worth supporting e.g. that the WES authentication serves for the DRS retrieval too.