Open uniqueg opened 4 years ago
We would probably prefer requiring percent encoding since we do use/permit forward slashes in the Dockstore TRS implementation. But otherwise, makes a lot of sense to me.
Thanks @denis-yuen, so would we.
No changes in the specs required, at least not urgently. In a next minor version bump one could think of adding a trs_uri
property to the tool and tool version schemas (unversioned and versioned, respectively) for easy reference (sorta like the self anchor URLs) and add a description that mentions that tool and version IDs should be percent-encoded (or at least the forward slashes).
As a first step though, I would propose to just change the TRS URI description in the Data Model docs. If others are okay with the proposal, I'd be happy to file a PR for that. That is, if someone could tell me where these docs are located. Didn't find them in the TRS spec repo...
The docs are in a different branch https://github.com/ga4gh/tool-registry-service-schemas/blob/gh-pages/2.0.0-Data-Model.md of the repo, based on gh-pages
Was thinking a little bit and figured that if we don't want any characters in tool/version identifiers to be restricted, as discussed (for backward compatibility with, e.g., Dockstore), we would need to prescribe that TRS URIs MUST be URL-encoded if they contain a certain set of special characters (specifically the forward slash and any special characters used in URL-encoding). We would also need to think about the implications of that, specifically with regard to TRS clients and services in general.
Two options:
First option is easier (no changes to specs required), second option is potentially more consistent and less ambiguous for certain use cases. E.g., what if the tool is called abcde/versions/12345
- how to make sure this works as a tool idea in calls to the GET /tools/{id}/versions/{version_id}
and descendant endpoints? But it needs changes in existing implementations and the specs themselves.
To highlight the problem:
my%2Ftool
my%252Ftool
my/tool
Now, if a client gets the ID 1., URL-encodes it (to result in ID 2.), then sends a request to a TRS endpoint with the encoded identifier, but the implementation uses it raw, without decoding, it either would return a 404
or the wrong tool (or version). Conversely, if the client would not encode it and use ID 1. raw for its calls to TRS, but the TRS would assume it is URL-encoded and decodes it, it would look for ID 3., which - again - would result in either a 404
or a wrong tool (or version).
For DRS this is addressed here:
Each implementation of DRS can choose its own id scheme, as long as it follows these guidelines:
- DRS IDs are strings made up of uppercase and lowercase letters, decimal digits, hypen, period, underscore and tilde [A-Za-z0-9.-_~]. See RFC 3986 § 2.3.
- DRS IDs can contain other characters, but they MUST be encoded into valid DRS IDs whenever they are used in API calls. This is because non-encoded IDs may interfere with the interpretation of the objects/{id}/access endpoint. To overcome this limitation use percent-encoding of the ID, see RFC 3986 § 2.4
- One DRS ID MUST always return the same object data (or, in the case of a collection, the same set of objects). This constraint aids with reproducibility.
- DRS implementations MAY have more than one ID that maps to the same object.
- DRS version 1.x does NOT support semantics around multiple versions of an object. (For example, there’s no notion of “get latest version” or “list all versions”.) Individual implementations MAY choose an ID scheme that includes version hints.
So this would rather suggest solution 2., even if it's rather implied than specifically stated (it appears that service implementations MUST assume IDs are URL-encoded)...
Would like to get some feedback on this before filing a PR.
Solution 2 (assume that identifiers are encoded for calls) sounds reasonable and I think is what we implemented (I'll need to double-check, but I'm pretty sure).
I'm not entirely sure I understand solution 1 actually. If only clients are doing encoding and decoding when needed, wouldn't we run into two characters mapping to one? (e.g. the server needs to understand that my%2Ftool
could be either my%2Ftool
or my/tool
)
I guess I meant the difference is that either the entire TRS URI is encoded (version 1) OR tool ID and version ID are encoded individually, then the corresponding TRS URI is constructed of the two parts according to trs://<server>/<encoded-id>/<encoded-version-id>
(version 2). Guess we will need version 2.
I will prepare a PR in the docs.
FWIW, we percent encode the ID prior to submitting a request to the actual trs endpoint. The choice to allow all characters in the ID has posed interesting challenges for us specifically sending requests or routing using dockstore TRS identifiers. We have had to disable certain firewall rules in our servers to allow for percent encoded urls to be embedded directly in the path. IMO ids should never be allowed to include things like /
or %2F
in them, as this is a security risk opening you up to potential path traversal attacks.
Good point and I agree with that. The reason why I did not exclude these from my proposal is simply that I wanted to remain consistent with how DRS URIs are defined. But I guess consistency shouldn't be worth more than security and ease of use.
Would you be willing to update the PR, @patmagee? Or is there anyone speaking out for the necessity for having all ASCII characters available for TRS (version) IDs?
We would probably prefer requiring percent encoding since we do use/permit forward slashes in the Dockstore TRS implementation. But otherwise, makes a lot of sense to me.
Hmmm, we still use /
but percent encoding it makes sense to us.
FWIW, we do have code to sanitize ids to prevent path traversal attacks, but I don't believe we needed to do anything special from a firewall POV, although I can double-check.
@denis-yuen OWASP has a pretty good description of the attack here. While we likely would not be subject to it, clients using WAF
s or other network protection tools may prevent a request with a percent encoded slash in the path to get past it's firewall. Its unclear to me whether this would be the case since it depends on their own rule set.
By Default spring has this behavaiour disabled, with the following motivation:
/**
* <p>
* Determines if a slash "/" that is URL encoded "%2F" should be allowed in the path
* or not. The default is to not allow this behavior because it is a common way to
* bypass URL based security.
* </p>
* <p>
* For example, due to ambiguities in the servlet specification, the value is not
* parsed consistently which results in different values in {@code HttpServletRequest}
* path related values which allow bypassing certain security constraints.
* </p>
*
* @param allowUrlEncodedSlash should a slash "/" that is URL encoded "%2F" be allowed
* in the path or not. Default is false.
*/
public void setAllowUrlEncodedSlash(boolean allowUrlEncodedSlash) {
if (allowUrlEncodedSlash) {
urlBlacklistsRemoveAll(FORBIDDEN_FORWARDSLASH);
} else {
urlBlacklistsAddAll(FORBIDDEN_FORWARDSLASH);
}
}
Yeah, I wouldn't object to a recommendation to avoid but at this juncture, I don't think we're likely to change either. I think we use a scanner to locate and change likely occurrences https://lgtm.com/rules/5970070/
Description
While we appreciate and strongly support the definition of TRS URIs, we think they will be much more useful if they'd contain the version identifier as well.
Some use cases:
fedora/httpd:version1.0
ormyregistryhost:5000/fedora/httpd:version1.0
) for each of my workflow steps will allow compute backends (e.g., Workflow Execution Service (WES) and/or Task Execution Service (TES) implementations) with TRS clients to choose their supported/favorite container technology from the available options.Proof of Concept
Within the ELIXIR Cloud & AAI ecosystem, we are already making experimental use of versioned TRS URIs of the form
trs://<server>/<id>/versions/<version-id>
to leverage the use cases described above:dockerPull
directives of a CWL demo workflow. The workflow engine we are using (cwl-tes, based on CWL's reference runner cwltool), does not complain about the versioned TRS URIs and neatly puts them in thetesTask.executors.$.image
fields of the TES requests it generates. Upon receiving such a request, our TES implementation TESK then calls the TRS instance encoded in the versioned TRS URI (BioContainers in our case) to resolve the actual container technology-specific image name (specifically, it uses the first Docker image it finds for that tool version), pulls that image from the actual container registry and executes the task.Proposed specification
While we have used TRS URIs of the form
trs://<server>/<id>/versions/<version-id>
for our PoC implementations, we would propose a more concise notation of the form:To ensure that versioned TRS URIs can be unambiguously parsed by clients, any
/
characters potentially occurring in tool and tool version identifiers would need to be percent-/URL-encoded before sharing a versioned TRS URI.Alternatively, but requiring a change that would potentially break existing TRS implementations, forward slashes could be explicitly forbidden in tool and version identifiers.
┆Issue is synchronized with this Jira Story ┆containerName: GA4GH tool-registry-service ┆Issue Number: TRS-48