ga4gh / tool-registry-service-schemas

APIs for discovering genomics tools, their metadata and their containers
Apache License 2.0
30 stars 18 forks source link

Issue with imported files "below" or "cousins" of the primary descriptor #243

Open denis-yuen opened 1 year ago

denis-yuen commented 1 year ago

See discussion at https://github.com/dockstore/dockstore/issues/5594 We believe this is a general issue for TRS.

i.e. workflows with a structure like

workflows
  |- wgs
    | - main.wdl
tasks
  |- do-stuff.wdl
structs
  | - human
    | - sample.wdl

Affects "/tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}":

┆Issue is synchronized with this Jira Story ┆Project Name: Zzz-ARCHIVE GA4GH tool-registry-service ┆Issue Number: TRS-67

uniqueg commented 1 year ago

Yes, nasty problem. Apart from the fact that it's impossible to recreate the original directory tree of a workflow with multiple descriptors (or, more generally, files, as non-descriptor files are affected, too), allowing ".." may have security implications as well.

Might be hard to come up with a non-breaking fix here, though given that the relative path behavior seems to be only explained in the descriptions, we might be able to just get away with rephrasing these such that the anchoring for relative paths should be the nearest parent directory for all files (e.g., the repository root). It'd still break existing clients though...

But if we're discussing breaking changes, we could also re-design the whole file handling to address the following issues:

One possibility could be to do away with the /descriptor, /tests and /containerfile endpoints in favor of a redesigned /files endpoint that makes use of HATEOAS (see #160) and uses file IDs instead of relative paths to fetch info about (ToolFile) and/or contents of (FileWrapper) individual files. It might also make the API less complex.

For example, I could hit /files to get an object of file ID keys and ToolFile values, as well as a _links section that points me to /files/{file_id} for each file to access the corresponding FileWrapper (or plain text). The bahvior for zip=True on /files could remain the same (i.e., fetch an archive of all files). For convenience, we could consider adding filters for file types (PRIMARY_DESCRIPTOR, TEST_FILE, CONTAINERFILE etc.).

denis-yuen commented 1 year ago

One possibility could be to do away with the /descriptor, /tests and /containerfile endpoints in favor of a redesigned /files endpoint that makes use of HATEOAS (see #160) and uses file IDs instead of relative paths to fetch info about (ToolFile) and/or contents of (FileWrapper) individual files. It might also make the API less complex.

for the non-breaking change discussion

Still thinking it through, but one option that we have is to alter the behaviour of the zipped endpoint https://github.com/ga4gh/tool-registry-service-schemas/blob/v2.0.1/openapi/openapi.yaml#L322 so that as you say, the root of the zip is the "nearest parent directory for all files" rather than the the directory for main.wdl The reason why this is less(?) of a breaking change is because our current implementation is simply broken for this case (i.e. one could argue that we're not making the situation worse at least) The client would need to do some comparison between the zip format and the json format of the same endpoint to locate main.wdl

for the breaking change discussion

A few new endpoints does seem less disruptive to backwards compatibility, thoughts @patmagee ?

uniqueg commented 1 year ago

I also think that the minor/non-breaking changes are acceptable, given that there probably aren't that many TRS services and clients around.

As for a re-design: Having a new /files endpoint (possibly without /{type} or similar endpoint with the described functionalities while possibly deprecating the ones that such an endpoint would make redundant would indeed be sensible, especially for such a drastic change.