SWI-Prolog / packages-clib

Assorted external libraries: processes, sockets, MIME, CGI, etc.
7 stars 18 forks source link

Do not unnecessarily encode colons in URIs/IRIs #14

Open wouterbeek opened 7 years ago

wouterbeek commented 7 years ago

The URI library currently encodes colon in the path and in the query component.

Colons in query components

In Semantic Web services it is very common to include IRIs in the query component, e.g., to indicate a selection or query. uri_query_components/2 encodes colons in the query component, even though this is not necessary. In the following example, %3A should simply be :. The # is legitimately encoded as %23, because it would otherwise be confused with the fragment component separator.

uri_query_components(Query, [predicate('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')]).
Query = 'predicate=http%3A//www.w3.org/1999/02/22-rdf-syntax-ns%23type'.

Colons in path components

Colons are not very common in IRIs, but some datasets (e.g., DBpedia) do use them. iri_normalized/2 unnecessarily encodes colons in paths, e.g., translating [1] to [2].

[1]   'http://dbpedia.org/resource/Category:Politics'
[2]   'http://dbpedia.org/resource/Category%3APolitics'

Reference

path = path-abempty    ; begins with "/" or is empty
     / path-absolute   ; begins with "/" but not "//"
     / path-noscheme   ; begins with a non-colon segment
     / path-rootless   ; begins with a segment
     / path-empty      ; zero characters
path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty    = 0<pchar>
segment       = *pchar
segment-nz    = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
              ; non-zero-length segment without any colon ":"
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
JanWielemaker commented 7 years ago

For query components you are probably right. For path components there is a problem that a relative uri can be mistaken for a fully qualified uri. That is what Samer discovered and has caused the current behaviour (older versions did not escape :). Some git blame and search on the mailinglist will probably find the discussion. This seems consistent with JavaScript encodeURIComponent(), which also escapes :.

I guess you want a canonical, minimally escaped URI? That is a different task that could be implemented in uri_normalized/2 (which now escapes : as it shares the code). Note that using a : in a segment is allowed, but complicates the translation of an absolute URI into a relative one.

I surely wouldn't call this a bug ...

wouterbeek commented 7 years ago

The use of an unescaped colon is actually not ambiguous. RFC 3986 took this into account:

A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative-path reference.

(I did not know this last year, otherwise I would gave given this pointer earlier.)

JanWielemaker commented 7 years ago

Interesting. This probably does require a different set of URI encoding primitives than that what is current practice though. Notably we not only need something to encode, but also something to create a relative URI. But, who is going to call that where?