drobilla / serd

A lightweight C library for RDF syntax
https://gitlab.com/drobilla/serd
ISC License
86 stars 15 forks source link

To what extent are IRIs parsed? #14

Closed wouterbeek closed 6 years ago

wouterbeek commented 6 years ago

Serd currently accepts IRI terms that do not follow the RFC 3987 grammar. E.g., the following input data contains a subject term that is not a valid IRI:

<_:x> <x:x> <x:x> .

The Turtle grammar seems to allow for almost arbitrary strings in the IRI production:

'<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>' 

However, this is not the entire story, because the Turtle standard also requires that relative IRIs be made absolute relative to a base URI. So a Turtle parser should at least be able to distinguish relative from absolute IRIs, which implies that it should at least implement the IRI grammar to some extent.

So my question is: to what extent does Serd intend to parse IRIs? Is it feasible to require full RFC 3987 compliance, or should we aim for compliance WRT a well-defined subset of that grammar?

drobilla commented 6 years ago

Serd does contain an implementation of URI parsing, but how/when/if it is used depends on the options, and it is somewhat lax. This one is interpreted as the relative URI "_:x" (which I think is still correct, if strange; nearly everything that doesn't contain outright invalid unescaped characters is a syntactically valid URI as far as I can tell, just usually not what you expect).

If you run serdi with a base URI you can see this:

$ cat test3.ttl 
@prefix eg: <http://example.org/> .
<_:x> <eg:x> <eg:x> .

$ serdi ./test3.ttl http://base.org/
<http://base.org/_:x> <eg:x> <eg:x> .

A hand-wavey short/vague answer to your question would be something like "it does when it needs to, but also defers to what the input says when in doubt".

Not sure what should happen here if no explicit base URI is given. I do find the pass-through useful and tools that aggressively mangle URIs can be annoying at times (e.g. in this case you'll end up with an absolute path to the input file in your output, which is almost certainly not what you want). This functionality is really useful when building pipelines, for example (you can cat files through serdi and get what you expect, whereas the strict behaviour would require some fabricated base URI which might end up in your output by mistake). Maybe strict mode could/should aggressively resolve URIs, or another option could/should control this? (this has some cost in throughput, by the way, but not much)

drobilla commented 6 years ago

On second glance I don't think it's valid according to RFC3987 (IRI). This parsing is due to the grammar for RFC3986 (URI) being what's actually implemented.

wouterbeek commented 6 years ago

I agree that _:x is neither an absolute nor a relative IRI. Ideally speaking, this would result in a syntax error (at least in strict mode). When I add a base URI to the example:

base <b:b>
<_:x> <x:x> <x:x> .

I get different outcomes depending on the parser I use:

<b:_:x> <x:x> <x:x> .  # Serd
<b:b_:x> <x:x> <x:x> . # N3

Serd and N3 seem to perform relative resolution in different ways, which is understandable if they have different partial implementations of the IRI grammar.

PS: In other parsers the behavior is different again; some parse the subject term as a blank node :-)

drobilla commented 6 years ago

Yeah, this aspect of Turtle is a bit weird, though in a sense I can see why, it would make sense for the grammar itself to be more general (indeed, I wish the blank node grammar was much more general).

Anyway, I think there's two core things here:

  1. Serd implements URIs, not IRIs, and doesn't reject some cases like this one as a result.
  2. Serd will happily pass through unresolved relative URIs as-is

For 1 I agree, so I'll take a look at implementing the IRI grammar instead and see what consequences that has. If there's no trouble there, it can simply be a syntax error, which would fix this issue (with possible regressions if anything depends on the more lax URI grammar but that seems unlikely?)

For 2 (though I think this will be less of an issue when the above is fixed), I'm not sure. I guess the sensible behaviours boil down to two options: resolve and re-qualify all input URIs, or pass qnames through as-is (handy for pretty printing and syntax translation for example). Honestly I don't remember a concrete reason why I did this, it might just be a performance thing (fully parsing/resolving all URIs isn't free), but I suppose I'll try to implement the resolve-everything behaviour and see what crops up.

drobilla commented 6 years ago

Well, easy enough to implement an error when colons are encountered in the path, but it turns out that IRI_with_all_punctuation.ttl in the Turtle test suite has a URI with a colon in the path (or maybe this makes it in the fragment or something by definition but implementing backtracking to do that sort of thing isn't really realistic).

I guess I could implement it so that things that start with a valid scheme are assumed to be valid IRIs and are not otherwise parsed...

drobilla commented 6 years ago

Ah, I didn't notice another (or perhaps the) issue here: the default output of serdi is NTriples, but the subject in the output above is not a valid IRI. If the output was Turtle, then it could be considered valid (depending on... philosophy), in which case it's interpreted as a strange relative IRI (with path "_:x"), but for NTriples it's definitely wrong.

wouterbeek commented 6 years ago

Under the assumption that I'm understanding your example correctly: _:x is not a relative IRI, because the colon is not allowed in the first path of a relative IRI. (In the RFC grammar this first path is called isegment-nz-nc for "no colon".)

drobilla commented 6 years ago

Right, but this is how the resolution algorithms from the spec would interpret it, even though the grammar in the IRI spec changes what is valid. Resolving and validating against the grammar are somewhat distinct issues/features.

You can see how always resolving against an absolute base URI works in https://github.com/drobilla/serd/tree/absolute-base-uri

This is consistent with what rapper does at least.

wouterbeek commented 6 years ago

@drobilla Thanks for the valuable feedback! I'm working on a small set of test cases that are parsed by most parsers, but that produce 'IRIs' that do not conform to the RFC IRI grammar.

There was also some discussion on the Semantic Web mailing list that may be interest to someone reading this issue later: https://lists.w3.org/Archives/Public/semantic-web/2018Mar/0011.html