ga4gh / tool-registry-service-schemas

APIs for discovering genomics tools, their metadata and their containers
Apache License 2.0
30 stars 18 forks source link

Workflow engines fail to run workflows referenced by certain TRS URLs #247

Open svonworl opened 7 months ago

svonworl commented 7 months ago

As detailed in https://github.com/dockstore/dockstore/issues/5594, some workflow engines, including miniwdl and Cromwell, fail to run a valid Dockstore workflow when:

For example, the following invocation fails:

miniwdl run 'https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fbroadinstitute%2Fwarp%2FWholeGenomeGermlineSingleSample/versions/tw_GL-2036_create_rtools_docker/PLAIN_WDL/descriptor'
(https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fbroadinstitute%2Fwarp%2FWholeGenomeGermlineSingleSample/versions/tw_GL-2036_create_rtools_docker/PLAIN_WDL/descriptor Ln 31 Col 1) Failed to import ../../../../../../tasks/broad/UnmappedBamToAlignedBam.wdl
HTTP Error 404: Not Found

Turns out the problem is bigger - the workflow engines also fail to run workflows that import files with relative paths.

The root cause is that, when the engines calculate the URL of an import, they interpret the specified TRS URL as a file path. However, a TRS URL doesn't represent a file path, so the engines miscalculate the import URLs and fail when they attempt to load them.

For example, given the TRS URL

https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fbroadinstitute%2Fwarp%2FWholeGenomeGermlineSingleSample/versions/tw_GL-2036_create_rtools_docker/PLAIN_WDL/descriptor

and an import referenced by a relative path

../../../../../../tasks/broad/UnmappedBamToAlignedBam.wdl

The engines calculate the import URL, by applying typical file resolution semantics, as:

https://dockstore.org/api/ga4gh/tasks/broad/UnmappedBamToAlignedBam.wdl

The above URL is a corrupt TRS URL, because parts of the original TRS URL have been deleted. During the import URL calculation, the engine drops the trailing descriptor portion of the TRS URL because it looks like a filename, and when the engine normalizes the URL prior to the request, it collapses the parent directory references and more of the original TRS URL is deleted.

Per the TRS spec, a relative path can be appended to the TRS primary descriptor URL, and it will resolve the file relative to the primary descriptor and return its contents. So, the correct URL is:

https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fbroadinstitute%2Fwarp%2FWholeGenomeGermlineSingleSample/versions/tw_GL-2036_create_rtools_docker/PLAIN_WDL/descriptor/../../../../../../tasks/broad/UnmappedBamToAlignedBam.wdl

Note that when miniwdl is run with a URL that references the raw github files, it works as expected:

miniwdl run 'https://raw.githubusercontent.com/broadinstitute/warp/tw_GL-2036_create_rtools_docker/pipelines/broad/dna_seq/germline/single_sample/wgs/WholeGenomeGermlineSingleSample.wdl'

Why does it work? The github URL ends with the absolute path of the workflow file, allowing the import urls to be resolved using typical file resolution semantics.

This issue should probably be addressed in the next major TRS revision.

In lieu of that, here are some possible solutions that would help the engines to correctly run a workflow referenced by a "bare" TRS primary descriptor URL:

┆Issue is synchronized with this Jira Story ┆Project Name: Zzz-ARCHIVE GA4GH tool-registry-service ┆Issue Number: TRS-70

uniqueg commented 6 months ago

I agree that the way the TRS spec references the relative location of accessory workflow files with respect to the primary descriptor is anywhere from confusing to limiting (see below). However, the issue you describe is probably mostly a server implementation issue (at Dockstore) and should probably be raised there (see TRS URLs section below).

On another note, I would suggest that clients (Cromwell, miniwdl etc.) support TRS URIs for user convenience (see TRS URIs section below).

TRS URLs

I agree with you that servers should properly resolve relative paths in TRS URLs of the form /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}, as per the specs.

Limitation

However, note that parent directories relative to the primary descriptor are not supported. From the relevant part of the specification:

A relative path to the additional file (same directory or subdirectories), for example 'foo.cwl' would return a 'foo.cwl' from the same directory as the main descriptor

So technically that means that only workflows are supported that follow a directory structure where the primary descriptor (main workflow file) is in the top level directory relative to all secondary descriptors (imported subworkflows/modules).

Either way, I guess Dockstore should probably respond with a 400 error when faced with illegal paths, not just do potentially unsafe path operations that don't even have a chance to ever result in anything useful.

@denis-yuen

Summary

Not being able to reference files in parent directories relative to the primary descriptor is indeed a limitation that I feel needs to be addressed in the TRS specs and could/should be further discussed here.

I don't know or remember what speaks against supporting parent directories (perhaps a security "feature"?), but it seems to me that given the current specs and a fully compliant TRS implementation, you are out of luck when there are secondary descriptors in a parent or sibling directory directory relative to the primary descriptor (like in your use case, @svonworl).

TRS URIs

Usage

Current limitations

Summary

TRS URIs make it easier for users to share TRS resources and start tool/workflow runs in supporting clients.

denis-yuen commented 6 months ago

However, note that parent directories relative to the primary descriptor are not supported. From the relevant part of the specification:

A relative path to the additional file (same directory or subdirectories), for example 'foo.cwl' would return a 'foo.cwl' from the same directory as the main descriptor

So technically that means that only workflows are supported that follow a directory structure where the primary descriptor (main workflow file) is in the top level directory relative to all secondary descriptors (imported subworkflows/modules).

Either way, I guess Dockstore should probably respond with a 400 error when faced with illegal paths, not just do potentially unsafe path operations that don't even have a chance to ever result in anything useful.

Ah interesting. We actually 400 or 404, so there's no security issue. Good catch though, I totally forgot about that limitation in there!

Not being able to reference files in parent directories relative to the primary descriptor is indeed a limitation that I feel needs to be addressed in the TRS specs and could/should be further discussed here.

So far, we've been downloading the workflows in zip format from the files endpoint.

TRS URIs make it easier for users to share TRS resources and start tool/workflow runs in supporting clients.

:+1:

uniqueg commented 6 months ago

So far, we've been downloading the workflows in zip format from the files endpoint.

Yes, you can do that. However, information about each file, including its path, should still be available via GET .../files when you request an application/json response. And according to the description of this operation

description: Get a list of objects that contain the relative path and file type.
  The descriptors are intended for use with the
  /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}
  endpoint. Returns a zip file of all files when format=zip is specified.

this information is

intended for use with the /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}

which, as described above, prohibits references to files that are not located in the primary descriptor's directory or a child directory thereof.

For reference, the GET .../files operation with application/json is supposed to return an instance of ToolFile:

ToolFile:
  type: object
  properties:
    path:
      type: string
      description: Relative path of the file.  A descriptor's path can be used with
        the GA4GH .../{type}/descriptor/{relative_path} endpoint.
    file_type:
      type: string
      enum:
        - TEST_FILE
        - PRIMARY_DESCRIPTOR
        - SECONDARY_DESCRIPTOR
        - CONTAINERFILE
        - OTHER
    checksum:
      $ref: "#/components/schemas/Checksum"

So while nothing is stopping an implementation from returning a ZIP archive containing whatever directory structure a workflow requires when clients call GET .../files with application/zip, I don't see how you can, in a spec-compliant way, allow clients to self-assemble the work directory themselves via ToolFile.path (from GET .../files with application/json) and FileWrapper (from GET .../descriptor/{relative_path}) if the workflow resource includes files that are not in the top-level directory relative to the location of the primary descriptor.

To maintain consistency with GET .../files with application/zip, implementers would have to ignore the limitation and put non-compliant values for ToolFile.path in such cases; and then support them via GET .../descriptor/{relative_path}, which is tricky, because clients often/typically resolve paths before requests are even sent out, meaning that clients would need to be instructed to URL-encode paths containing references to parent directories (i.e., those containing ..).

To address this issue without having to ever deal with parent directory referenes, perhaps we could require resources to use an implicit workflow top-level directory (e.g., a Git repository root directory) as an anchor when constructing relative paths for workflow files to be made available via ToolFile.path. Clients can then infer the required top-level directory for the file objects they are interested in and construct an appropriate directory structure for their use case.

This approach would have the added benefit of us only needing to change descriptions, and so perhaps we could get away with just sneaking in such a change in for a future minor version bump of the TRS specs.

adamnovak commented 1 month ago

I'm running into this problem trying to add support for TRS lookups to the Toil workflow runner.

Since Toil already knows how to read and run a CWL or WDL workflow from a plain HTTP-hosted directory tree, I want to query TRS and somehow get from it a URL to the main workflow descriptor file in its appropriate directory context.

Dockstore kind of supports this already; the TRS IDs on its workflow pages link to URLs like https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fvgteam%2Fvg_wdl%2FGiraffe/versions/giraffedv/PLAIN_WDL/descriptor//workflows/giraffe.wdl. Note the extra leading / in the TRS "relative" path. When the workflow references ../tasks/sometask.wdl, normal hypertext URL resolution rules can be applied, and the machinery to download the workflow source doesn't actually need to speak TRS specifically at all.

But this seems to be a Dockstore extension, and TRS doesn't itself provide a way to get that /workflows/giraffe.wdl absolute path from the common parent directory to the primary descriptor, since it doesn't anticipate you ever needing it.

If the API is up for redesign here, I would favor designs that look as much like plain HTTP as possible, over designs that officially support fetching files in higher-level directories but require a sort of TRS-specific virtual filesystem implementation in the client.