drobilla / serd

A lightweight C library for RDF syntax
https://gitlab.com/drobilla/serd
ISC License
86 stars 15 forks source link

Unable to parse triple-quoted literal #33

Closed wouterbeek closed 2 years ago

wouterbeek commented 2 years ago

The following triple is valid according to rule 25 in the Turtle grammar:

[]a""""""".

But Serdi emits the following error when parsing:

_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> "" .
error: /home/wouter/tmp/test.ttl:1:10: missing ';' or '.'

This was observed with the last commit on the master branch, but may appear in earlier commits/versions as well.

drobilla commented 2 years ago

Looks like it's been that way for a while, probably forever (rapper doesn't parse it either). I'll fix it, thanks.

drobilla commented 2 years ago

The following triple is valid according to rule 25 in the Turtle grammar:

Wait... is it?

We have:

Pat: '"""' (('"' | '""')? ([^"\] | ECHAR | UCHAR))* '"""'
Str: """""""

After the first three:

Pat: (('"' | '""')? ([^"\] | ECHAR | UCHAR))* '"""'
Str: """"

Match '""', leaving:

Pat: (([^"\] | ECHAR | UCHAR))* '"""'
Str: ""

But the first character ('"') doesn't match that (both ECHAR and UCHAR start with '\')

wouterbeek commented 2 years ago

Thanks for looking into this @drobilla !

In your example parse you choose the two double quotes ('"") over the single double quote ('"'), even though both are allowed by rule [25]:

[25] (('"' | '""')? ([^"\] | ECHAR | UCHAR))* '"""'

Is your motivation for preferring the two double quotes over the single double quote captured in the following note?

  1. When tokenizing the input and choosing grammar rules, the longest match is chosen.

If so, is my following conclusion correct?

Conclusion: The reported literal (""""""") does not conform to the Turtle grammar because in rule [25] the two double quotes ('""') comprise one token, and this one token is longer than the single double quote ('"') which is also one token.

drobilla commented 2 years ago

Yeah, something like that, but I'm not sure. The whole thing (STRING_LITERAL_LONG) is the token, and the spec is unclear to me about the tokenization. I assumed that bit you quoted there translates to being greedy. This stuff can get confusing sometimes because serd is a byte-wise parser that doesn't have a separate tokenization step. Before RDF 1.1 when the parser was originally written, Turtle was a nice clean grammar to parse this way, but RDF 1.1 added a bunch of lookahead and other confusing ambiguous-feeling cases from SPARQL. That said, this rule is reasonable/useful (I am very unimpressed with the mess the WG made of Turtle in RDF 1.1 in general, but that's a whole other discussion...).

Regardless, I'll try to make it work. Pedantry aside, it seems obvious that the only sensible interpretation of this is a string of a single ". rdfpipe does parse it. It's tricky because this requires even more lookahead (4 chars I think?) than all the messy . cases, though, and there's nothing else like this in the parser. Since all the relevant characters are the same one, though, I think I can come up with something decent...

Out of curiosity, where did you encounter this?

drobilla commented 2 years ago

Hm, so, I figured out a way to parse this, but...

[ FAILED ] test/TurtleTests/turtle-syntax-bad-string-06.ttl

That (negative) test is:

# Long literal with 4"
@prefix : <http://www.w3.org/2013/TurtleTests/> .
:s :p """abc""""@en .

So it seems that this is not valid after all? I don't see how this test could fail, but the original example could work. As it happens, rdfpipe successfully parses that test without an error, i.e., fails the Turtle test suite.

... except it allegedly passed according to https://w3c.github.io/rdf-tests/turtle/reports/#individual-test-results ¯\(ツ)

It seems like better behaviour to support this (I don't see any potential problems with it anyway), but I am not comfortable with making serd non-conformant to do it. I guess I could try to make it work in lax mode, but I suspect that will be pretty messy.

wouterbeek commented 2 years ago

If lexical form """abc"""" is part of a negative test then that definitively clears up this issue I think. The parser must choose the longest token conceptually (whether or not tokenization is implemented as a separate step does not matter), so the sequence """abc""" is parsed as a three-character string literal. The last double quote is not parsed as part of the literal and causes a syntax error.

In our example the same thing happens: the first three double quotes are the opening delimiter. Since the longest token is preferred, the next three double quotes are chosen, which forms the closing delimiter (effectively parsing the empty string literal). The last double quote is not parsed as part of the literal and causes a syntax error.

^ Notice that this parse is slightly different from your earlier explained parse. You chose the two double quotes since they formed a longer token than the single double quote. But three double quotes form an even longer token. So the parser still fails on the lexical form that consists of seven double quotes, but for a slightly different reason.

The whole thing (STRING_LITERAL_LONG) is the token,

No, this is not true: STRING_LITERAL_LONG is a terminal but not a token. If it would be a token then the negative test case should have been a positive test case instead (because """abc"""" (four trailing double quotes) is longer than """abc""" (three trailing double quotes) and the lexical form would be legal).

I guess I could try to make it work in lax mode

No, I don't think this is needed (or wanted). I now believe that my issue is invalid and that Serd is currently doing the right thing already.

Out of curiosity, where did you encounter this?

We host a free-to-use triple store over at https://triplydb.com. People upload all kinds of data files there :-)

Thank you for the good discussion in this issue. You helped me recognize that my 'bug' was not a bug at all.

drobilla commented 2 years ago

I see. Yep, makes sense to me. Thanks for digging in to it.