ga4gh / tool-registry-service-schemas

APIs for discovering genomics tools, their metadata and their containers
Apache License 2.0
30 stars 18 forks source link

Add versioned TRS URI to documentation #164

Open uniqueg opened 4 years ago

uniqueg commented 4 years ago

Description

While we appreciate and strongly support the definition of TRS URIs, we think they will be much more useful if they'd contain the version identifier as well.

Some use cases:

  1. I have used a tool registered on TRS for my study and would like to reference it in my publication. Referencing a versioned (but not an unversioned!) TRS URI will allow others to locate and use the exact same version of the tool I have used in my analysis. Copying the versioned TRS URI into a suitable web app will ideally allow them to reproduce my analysis or reuse it on their own data with just a few clicks.
  2. I am developing a workflow composed of several steps of defined and reproducible behavior, and I have decided to containerize my workflow for increased portability and reproducibility (with the known limitations). To further increase my potential user base, I would like the workflow to run on as many compute infrastructures as possible. Therefore, I would like to remain agnostic of a particular container technology (e.g., Docker, Singularity), provided that containers created from different image types behave equivalently. Having verified that the different images registered for a given tool version in a TRS implementation indeed behave equivalently for my test cases, specifying versioned TRS URIs instead of container technology-specific image names (e.g., for Docker: fedora/httpd:version1.0 or myregistryhost:5000/fedora/httpd:version1.0) for each of my workflow steps will allow compute backends (e.g., Workflow Execution Service (WES) and/or Task Execution Service (TES) implementations) with TRS clients to choose their supported/favorite container technology from the available options.

Proof of Concept

Within the ELIXIR Cloud & AAI ecosystem, we are already making experimental use of versioned TRS URIs of the form trs://<server>/<id>/versions/<version-id> to leverage the use cases described above:

Only tried for CWL workflows, but should work readily for any descriptor type.

Only tried for CWL workflows with the cwl-tes workflow engine; may or may not work for other workflow languages/engines. Ideally, support for versioned TRS URIs would be explicitly specified/implemented in languages/engines in order to guarantee reliable and robust performance.

Proposed specification

While we have used TRS URIs of the form trs://<server>/<id>/versions/<version-id> for our PoC implementations, we would propose a more concise notation of the form:

trs://<server>/<id>/<version-id>

To ensure that versioned TRS URIs can be unambiguously parsed by clients, any / characters potentially occurring in tool and tool version identifiers would need to be percent-/URL-encoded before sharing a versioned TRS URI.

Alternatively, but requiring a change that would potentially break existing TRS implementations, forward slashes could be explicitly forbidden in tool and version identifiers.

┆Issue is synchronized with this Jira Story ┆containerName: GA4GH tool-registry-service ┆Issue Number: TRS-48

denis-yuen commented 4 years ago

We would probably prefer requiring percent encoding since we do use/permit forward slashes in the Dockstore TRS implementation. But otherwise, makes a lot of sense to me.

uniqueg commented 4 years ago

Thanks @denis-yuen, so would we.

No changes in the specs required, at least not urgently. In a next minor version bump one could think of adding a trs_uri property to the tool and tool version schemas (unversioned and versioned, respectively) for easy reference (sorta like the self anchor URLs) and add a description that mentions that tool and version IDs should be percent-encoded (or at least the forward slashes).

As a first step though, I would propose to just change the TRS URI description in the Data Model docs. If others are okay with the proposal, I'd be happy to file a PR for that. That is, if someone could tell me where these docs are located. Didn't find them in the TRS spec repo...

denis-yuen commented 3 years ago

The docs are in a different branch https://github.com/ga4gh/tool-registry-service-schemas/blob/gh-pages/2.0.0-Data-Model.md of the repo, based on gh-pages

uniqueg commented 3 years ago

Was thinking a little bit and figured that if we don't want any characters in tool/version identifiers to be restricted, as discussed (for backward compatibility with, e.g., Dockstore), we would need to prescribe that TRS URIs MUST be URL-encoded if they contain a certain set of special characters (specifically the forward slash and any special characters used in URL-encoding). We would also need to think about the implications of that, specifically with regard to TRS clients and services in general.

Two options:

  1. We can get away with clients dealing with TRS URIs doing URL decoding and then using the decoded tool/version identifiers in any calls to actual TRS endpoints.
  2. We need to prescribe that (a) a TRS implementation MUST assume that tool/version identifiers themselves, not only TRS URIs, are URL-encoded, and (b) TRS clients MUST encode tool/version identifiers in this way, if they contain ambiguous characters, for any call to a TRS service.

First option is easier (no changes to specs required), second option is potentially more consistent and less ambiguous for certain use cases. E.g., what if the tool is called abcde/versions/12345 - how to make sure this works as a tool idea in calls to the GET /tools/{id}/versions/{version_id} and descendant endpoints? But it needs changes in existing implementations and the specs themselves.

To highlight the problem:

  1. Assume that the actual identifier of tool X in a TRS instance is: my%2Ftool
  2. After URL-encoding, this becomes: my%252Ftool
  3. If decoded without prior URL-encoding it becomes: my/tool

Now, if a client gets the ID 1., URL-encodes it (to result in ID 2.), then sends a request to a TRS endpoint with the encoded identifier, but the implementation uses it raw, without decoding, it either would return a 404 or the wrong tool (or version). Conversely, if the client would not encode it and use ID 1. raw for its calls to TRS, but the TRS would assume it is URL-encoded and decodes it, it would look for ID 3., which - again - would result in either a 404 or a wrong tool (or version).

For DRS this is addressed here:

Each implementation of DRS can choose its own id scheme, as long as it follows these guidelines:

  • DRS IDs are strings made up of uppercase and lowercase letters, decimal digits, hypen, period, underscore and tilde [A-Za-z0-9.-_~]. See RFC 3986 § 2.3.
  • DRS IDs can contain other characters, but they MUST be encoded into valid DRS IDs whenever they are used in API calls. This is because non-encoded IDs may interfere with the interpretation of the objects/{id}/access endpoint. To overcome this limitation use percent-encoding of the ID, see RFC 3986 § 2.4
  • One DRS ID MUST always return the same object data (or, in the case of a collection, the same set of objects). This constraint aids with reproducibility.
  • DRS implementations MAY have more than one ID that maps to the same object.
  • DRS version 1.x does NOT support semantics around multiple versions of an object. (For example, there’s no notion of “get latest version” or “list all versions”.) Individual implementations MAY choose an ID scheme that includes version hints.

So this would rather suggest solution 2., even if it's rather implied than specifically stated (it appears that service implementations MUST assume IDs are URL-encoded)...

Would like to get some feedback on this before filing a PR.

denis-yuen commented 3 years ago

Solution 2 (assume that identifiers are encoded for calls) sounds reasonable and I think is what we implemented (I'll need to double-check, but I'm pretty sure).

I'm not entirely sure I understand solution 1 actually. If only clients are doing encoding and decoding when needed, wouldn't we run into two characters mapping to one? (e.g. the server needs to understand that my%2Ftool could be either my%2Ftool or my/tool)

uniqueg commented 3 years ago

I guess I meant the difference is that either the entire TRS URI is encoded (version 1) OR tool ID and version ID are encoded individually, then the corresponding TRS URI is constructed of the two parts according to trs://<server>/<encoded-id>/<encoded-version-id> (version 2). Guess we will need version 2.

I will prepare a PR in the docs.

patmagee commented 2 years ago

FWIW, we percent encode the ID prior to submitting a request to the actual trs endpoint. The choice to allow all characters in the ID has posed interesting challenges for us specifically sending requests or routing using dockstore TRS identifiers. We have had to disable certain firewall rules in our servers to allow for percent encoded urls to be embedded directly in the path. IMO ids should never be allowed to include things like / or %2F in them, as this is a security risk opening you up to potential path traversal attacks.

uniqueg commented 2 years ago

Good point and I agree with that. The reason why I did not exclude these from my proposal is simply that I wanted to remain consistent with how DRS URIs are defined. But I guess consistency shouldn't be worth more than security and ease of use.

Would you be willing to update the PR, @patmagee? Or is there anyone speaking out for the necessity for having all ASCII characters available for TRS (version) IDs?

denis-yuen commented 2 years ago

We would probably prefer requiring percent encoding since we do use/permit forward slashes in the Dockstore TRS implementation. But otherwise, makes a lot of sense to me.

Hmmm, we still use / but percent encoding it makes sense to us. FWIW, we do have code to sanitize ids to prevent path traversal attacks, but I don't believe we needed to do anything special from a firewall POV, although I can double-check.

patmagee commented 2 years ago

@denis-yuen OWASP has a pretty good description of the attack here. While we likely would not be subject to it, clients using WAFs or other network protection tools may prevent a request with a percent encoded slash in the path to get past it's firewall. Its unclear to me whether this would be the case since it depends on their own rule set.

By Default spring has this behavaiour disabled, with the following motivation:

/**
     * <p>
     * Determines if a slash "/" that is URL encoded "%2F" should be allowed in the path
     * or not. The default is to not allow this behavior because it is a common way to
     * bypass URL based security.
     * </p>
     * <p>
     * For example, due to ambiguities in the servlet specification, the value is not
     * parsed consistently which results in different values in {@code HttpServletRequest}
     * path related values which allow bypassing certain security constraints.
     * </p>
     *
     * @param allowUrlEncodedSlash should a slash "/" that is URL encoded "%2F" be allowed
     * in the path or not. Default is false.
     */
    public void setAllowUrlEncodedSlash(boolean allowUrlEncodedSlash) {
        if (allowUrlEncodedSlash) {
            urlBlacklistsRemoveAll(FORBIDDEN_FORWARDSLASH);
        } else {
            urlBlacklistsAddAll(FORBIDDEN_FORWARDSLASH);
        }
    }
denis-yuen commented 2 years ago

Yeah, I wouldn't object to a recommendation to avoid but at this juncture, I don't think we're likely to change either. I think we use a scanner to locate and change likely occurrences https://lgtm.com/rules/5970070/