Open svonworl opened 7 months ago
I agree that the way the TRS spec references the relative location of accessory workflow files with respect to the primary descriptor is anywhere from confusing to limiting (see below). However, the issue you describe is probably mostly a server implementation issue (at Dockstore) and should probably be raised there (see TRS URLs section below).
On another note, I would suggest that clients (Cromwell, miniwdl
etc.) support TRS URIs for user convenience (see TRS URIs section below).
I agree with you that servers should properly resolve relative paths in TRS URLs of the form /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}
, as per the specs.
However, note that parent directories relative to the primary descriptor are not supported. From the relevant part of the specification:
A relative path to the additional file (same directory or subdirectories), for example 'foo.cwl' would return a 'foo.cwl' from the same directory as the main descriptor
So technically that means that only workflows are supported that follow a directory structure where the primary descriptor (main workflow file) is in the top level directory relative to all secondary descriptors (imported subworkflows/modules).
Either way, I guess Dockstore should probably respond with a 400
error when faced with illegal paths, not just do potentially unsafe path operations that don't even have a chance to ever result in anything useful.
@denis-yuen
Not being able to reference files in parent directories relative to the primary descriptor is indeed a limitation that I feel needs to be addressed in the TRS specs and could/should be further discussed here.
I don't know or remember what speaks against supporting parent directories (perhaps a security "feature"?), but it seems to me that given the current specs and a fully compliant TRS implementation, you are out of luck when there are secondary descriptors in a parent or sibling directory directory relative to the primary descriptor (like in your use case, @svonworl).
TRS URIs make it easier for users to share TRS resources and start tool/workflow runs in supporting clients.
However, note that parent directories relative to the primary descriptor are not supported. From the relevant part of the specification:
A relative path to the additional file (same directory or subdirectories), for example 'foo.cwl' would return a 'foo.cwl' from the same directory as the main descriptor
So technically that means that only workflows are supported that follow a directory structure where the primary descriptor (main workflow file) is in the top level directory relative to all secondary descriptors (imported subworkflows/modules).
Either way, I guess Dockstore should probably respond with a
400
error when faced with illegal paths, not just do potentially unsafe path operations that don't even have a chance to ever result in anything useful.
Ah interesting. We actually 400 or 404, so there's no security issue. Good catch though, I totally forgot about that limitation in there!
Not being able to reference files in parent directories relative to the primary descriptor is indeed a limitation that I feel needs to be addressed in the TRS specs and could/should be further discussed here.
So far, we've been downloading the workflows in zip format from the files endpoint.
TRS URIs make it easier for users to share TRS resources and start tool/workflow runs in supporting clients.
:+1:
So far, we've been downloading the workflows in zip format from the files endpoint.
Yes, you can do that. However, information about each file, including its path
, should still be available via GET .../files
when you request an application/json
response. And according to the description of this operation
description: Get a list of objects that contain the relative path and file type.
The descriptors are intended for use with the
/tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}
endpoint. Returns a zip file of all files when format=zip is specified.
this information is
intended for use with the /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}
which, as described above, prohibits references to files that are not located in the primary descriptor's directory or a child directory thereof.
For reference, the GET .../files
operation with application/json
is supposed to return an instance of ToolFile
:
ToolFile:
type: object
properties:
path:
type: string
description: Relative path of the file. A descriptor's path can be used with
the GA4GH .../{type}/descriptor/{relative_path} endpoint.
file_type:
type: string
enum:
- TEST_FILE
- PRIMARY_DESCRIPTOR
- SECONDARY_DESCRIPTOR
- CONTAINERFILE
- OTHER
checksum:
$ref: "#/components/schemas/Checksum"
So while nothing is stopping an implementation from returning a ZIP archive containing whatever directory structure a workflow requires when clients call GET .../files
with application/zip
, I don't see how you can, in a spec-compliant way, allow clients to self-assemble the work directory themselves via ToolFile.path
(from GET .../files
with application/json
) and FileWrapper
(from GET .../descriptor/{relative_path}
) if the workflow resource includes files that are not in the top-level directory relative to the location of the primary descriptor.
To maintain consistency with GET .../files
with application/zip
, implementers would have to ignore the limitation and put non-compliant values for ToolFile.path
in such cases; and then support them via GET .../descriptor/{relative_path}
, which is tricky, because clients often/typically resolve paths before requests are even sent out, meaning that clients would need to be instructed to URL-encode paths containing references to parent directories (i.e., those containing ..
).
To address this issue without having to ever deal with parent directory referenes, perhaps we could require resources to use an implicit workflow top-level directory (e.g., a Git repository root directory) as an anchor when constructing relative paths for workflow files to be made available via ToolFile.path
. Clients can then infer the required top-level directory for the file objects they are interested in and construct an appropriate directory structure for their use case.
This approach would have the added benefit of us only needing to change descriptions, and so perhaps we could get away with just sneaking in such a change in for a future minor version bump of the TRS specs.
I'm running into this problem trying to add support for TRS lookups to the Toil workflow runner.
Since Toil already knows how to read and run a CWL or WDL workflow from a plain HTTP-hosted directory tree, I want to query TRS and somehow get from it a URL to the main workflow descriptor file in its appropriate directory context.
Dockstore kind of supports this already; the TRS IDs on its workflow pages link to URLs like https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fvgteam%2Fvg_wdl%2FGiraffe/versions/giraffedv/PLAIN_WDL/descriptor//workflows/giraffe.wdl
. Note the extra leading /
in the TRS "relative" path. When the workflow references ../tasks/sometask.wdl
, normal hypertext URL resolution rules can be applied, and the machinery to download the workflow source doesn't actually need to speak TRS specifically at all.
But this seems to be a Dockstore extension, and TRS doesn't itself provide a way to get that /workflows/giraffe.wdl
absolute path from the common parent directory to the primary descriptor, since it doesn't anticipate you ever needing it.
If the API is up for redesign here, I would favor designs that look as much like plain HTTP as possible, over designs that officially support fetching files in higher-level directories but require a sort of TRS-specific virtual filesystem implementation in the client.
As detailed in https://github.com/dockstore/dockstore/issues/5594, some workflow engines, including miniwdl and Cromwell, fail to run a valid Dockstore workflow when:
..
).For example, the following invocation fails:
Turns out the problem is bigger - the workflow engines also fail to run workflows that import files with relative paths.
The root cause is that, when the engines calculate the URL of an import, they interpret the specified TRS URL as a file path. However, a TRS URL doesn't represent a file path, so the engines miscalculate the import URLs and fail when they attempt to load them.
For example, given the TRS URL
and an import referenced by a relative path
The engines calculate the import URL, by applying typical file resolution semantics, as:
The above URL is a corrupt TRS URL, because parts of the original TRS URL have been deleted. During the import URL calculation, the engine drops the trailing
descriptor
portion of the TRS URL because it looks like a filename, and when the engine normalizes the URL prior to the request, it collapses the parent directory references and more of the original TRS URL is deleted.Per the TRS spec, a relative path can be appended to the TRS primary descriptor URL, and it will resolve the file relative to the primary descriptor and return its contents. So, the correct URL is:
Note that when miniwdl is run with a URL that references the raw github files, it works as expected:
Why does it work? The github URL ends with the absolute path of the workflow file, allowing the import urls to be resolved using typical file resolution semantics.
This issue should probably be addressed in the next major TRS revision.
In lieu of that, here are some possible solutions that would help the engines to correctly run a workflow referenced by a "bare" TRS primary descriptor URL:
┆Issue is synchronized with this Jira Story ┆Project Name: Zzz-ARCHIVE GA4GH tool-registry-service ┆Issue Number: TRS-70