IIIF / api

Source for API and model specifications documents (api and model)
http://iiif.io/api
105 stars 54 forks source link

URI encoding in Content State #2292

Open azaroth42 opened 2 months ago

azaroth42 commented 2 months ago

The current 1.0 spec is non-deterministic about embedded URI decoding in the plain URI case.

The encoding requirement is:

When the content state is a plain URI, rather than a JSON object, it MUST NOT be content-state-encoded.

However it doesn't give an algorithm as how to interpret the URIs embedded in the query parameter in terms of URI encoding.

The previous decision for the proposed spec was to encode them (cf discovery#90) and then was changed to URI encode them after TRC feedback, but that URI encoding requirement never seems to have made it into the spec.

Currently I believe it is non-deterministic as to how a client should interpret the payload: iiif-content=https://example.org/manifest/%2550 -- is it %50 (1 encoding) or P (2 encodings) or %2550 (0 encodings) on the end?

zimeon commented 2 months ago

I don't think there is any uncertainty about number of encoding/decoding actions. Per https://datatracker.ietf.org/doc/html/rfc3986#section-2.4 encoding SHOULD be done once when the URI is assembled from its component parts, and decoding MUST be done once when the URI is disassembled into its component parts (see also https://datatracker.ietf.org/doc/html/rfc3986#section-7.3). Attempts to do encoding/decoding separately from the URI assembly/disassembly steps will, in the general case, lead to errors.

Let's work though an example: If I build a URI from host1, path1 and query1 (leaving out fragment, assuming https scheme), I create a URI1 = https://ENC(host1)/ENC(path1)?ENC(query1) where ENC is my encoding function (omitting for now that ENC may be different for different components because they have different allowed characters).

Now say that I then build another URI that references URI1 in its query, with components host2, path2 and query iiif-content=URI1. I use the same procedure and get UR2 = https://ENC(host2)/ENC(path2)?iiif-content=ENC(https://ENC(host1)/ENC(path1)?ENC(query1)) -- there is double encoding of the embedded URI1 components and there must be double decoding when we reverse the two assembly processes.

Thus, to take Rob's example, if I am given a Content State URL:

https://example.com/viewer?iiif-content=https://example.org/manifest/%2550

this means (decode as we split) https://example.com/viewer (no change on decoding) should open https://example.org/manifest/%50 (%25 -> % on decode) which will then be interpreted as https//example.org/manifest/P (%50 -> P on decode) when split into its components.

(In this case we could just as well have used https://example.com/viewer?iiif-content=https://example.org/manifest/P or https://example.com/viewer?iiif-content=https://example.org/manifest/%50 because the P (like the / and . in the manifest URI) didn't need encoding)

zimeon commented 2 months ago

To put the above another way: We don't get a choice of specifying when something should or should not be URI encoded, it is already part of the URI spec (albeit with great latitude allowing much more than the bare minimum to be encoded if one chooses). We can add helpful reminders though.

azaroth42 commented 2 months ago

To work a more normal case (I chose %2550 as an easy one to read): a manifest and viewer that themselves have query params.

So, the manifest's URI is: https://example.org/builder?id=1&fmt=iiif The viewer's URI is https://example.com/viewer?fmt=iiif

We construct the parameter first: https://ENC(example.org)/ENC(builder)?ENC(id=1&fmt=iiif) --> https://example.org/builder?id=1%26fmt=iiif

(the & is part of the query part of the http URI syntax, so is encoded by the ? is part of the definition, and so is not, per https://datatracker.ietf.org/doc/html/rfc3986#section-3)

Then we add the query param iiif-content to the viewer URI to get https://example.com/viewer?fmt=iiif&iiif-content=

We then add the encoded param: https://example.com/viewer?fmt=iiif&iiif-content=https://example.org/builder?id=1%26fmt=iiif

And encode:

https://ENC(example.com)/ENC(viewer)?ENC(fmt=iiif&iiif-content=https://example.org/builder?id=1%26fmt=iiif)

with a result of

https://example.com/viewer?fmt=iiif%26iiif-content=https://example.org/builder%3Fid=1%2526fmt=iiif

Assuming that =, : and / are all safe, but ? and & are not.

If that's correct, and even if we're just following the rules, I think having that spelled out in no uncertain terms is necessary for folks to end up at the same conclusion.

zimeon commented 2 months ago

@azaroth42 - I agree with the above except the encoding of & as %26 when building the manifest URI. I was sloppy when I suggested ENC(query) above, sorry!

This is not part of RFC3986 URI syntax requirements -- they generally would allow & to be unencoded and usually encoded would also be OK (I copied the relevant BNF syntax rules in https://github.com/IIIF/trc/issues/125#issuecomment-2075872243, & is in sub-delims).

However, the key1=value1&key2=value2 format for the query makes a distinction between & and = that might be in keys and values, and instances where they are part of the structure. I think this is defined by https://url.spec.whatwg.org/#application/x-www-form-urlencoded, that is what Python3's urllib.parse references). The decoding algorithm has: split on & then for each chunk split on first = and then decode/unquote the resulting key and value strings.

>>> import urllib
>>> urllib.parse.parse_qsl('id=1%26fmt=iiif')    # broken
[('id', '1&fmt=iiif')]
>>> urllib.parse.parse_qsl('id=1&fmt=iiif')      # good
[('id', '1'), ('fmt', 'iiif')]
>>> urllib.parse.parse_qsl('ke%3Dy1=val%3Due1&ke%26y2=val%26ue2&key3=second=unencoded') # complex!
[('ke=y1', 'val=ue1'), ('ke&y2', 'val&ue2'), ('key3', 'second=unencoded')]

To rerun your example:

>>> import urllib
>>> p = urllib.parse.urlsplit('https://example.com/viewer?fmt=iiif&iiif-content=https://example.com/builder?id%261%3Dfmt%26iiif')
>>> p
SplitResult(scheme='https', netloc='example.com', path='/viewer', query='fmt=iiif&iiif-content=https://example.com/builder?id%261%3Dfmt%26iiif', fragment='')
>>> urllib.parse.parse_qsl(p.query)
[('fmt', 'iiif'), ('iiif-content', 'https://example.com/builder?id&1=fmt&iiif')]

(for fully minimal encoding, which I don't recommend, the additional = in the manifest URI don't need to be encoded because the split is on the first occurrence...

>>> urllib.parse.parse_qsl(urllib.parse.urlsplit('https://example.com/viewer?fmt=iiif&iiif-content=https://example.com/builder?id=1%3Dfmt=iiif').query)
[('fmt', 'iiif'), ('iiif-content', 'https://example.com/builder?id=1=fmt=iiif')]

)

azaroth42 commented 2 months ago

I'm convinced :) I think the above recipe should be published, preferably in the next version of the spec, and before then as a cookbook recipe.