Add canonical parsing of lexical forms

wouterbeek commented 6 years ago

A datatype defines its own functions, mapping the lexical form to the value space, and mapping values from the value space to the lexical space. Oftentimes, there is a canonical variant of the latter, which allows values to be serialized in one, canonical way. It would be great if Serd would allow literals to be parsed in a way that guarantees that the lexical forms of some common datatypes are written down in a canonical way.

Here are some examples of non-canonical lexical forms and their canonical variants:

"01"^^xsd:int ⇒ "1"^^xsd:int
"0.0001"^^xsd:double ⇒ "0.1E-3"^^xsd:double
"1"^^xsd:boolean ⇒ "true"^^xsd:boolean

Canonical lexical forms are preferred over non-canonical ones, because (1) they allow value equality to be determined simply by comparing lexical form equality, and (2) they allow the relative ordering of many values (i.e., < and >) to be determined based on a simple comparison of their lexical forms.

There are many use cases in which these two properties have a profound impact. Two use cases of only the former property are (1) SPARQL filter expressions that check for literal value equality, and (2) graph navigation where the nodes "1"^^xsd:boolean and "true"^^xsd:boolean should be treated as the same node.

drobilla commented 6 years ago

Hm. I think this would be much easier to implement when writing rather than reading. I'll have to think about it.

Do you know of an existing RDF-based test suite for canonicalization, or do I need to wade through the xsd specs and write my own?

wouterbeek commented 6 years ago

I'm not sure about the test suite. There is one for XML Schema 1.0 (link), but I'm not sure whether it covers datatype canonicity, nor whether there is a test suite for XML Schema 1.1. However, I would be able to collaboratively create such a test suite.

The XSD specification is quite lengthy and -- at points -- difficult to process, but I can help out there as well.

drobilla commented 6 years ago

Whew, this is a can of nebulous enterprisey worms (like everything XML). I haven't managed to find anything all that comprehensive, or even existing implementations of just this part, but I suppose "comprehensive" isn't really in the cards anyway. Which raises the question: is having only some canonicalization even useful?

Architecturally, I'm not sure if serd as it is would be the place for this. For one thing there's no value space in serd whatsoever, it's purely syntactic and aims to do lossless round-tripping. That said, I've often wished for value support for things like numbers and bool, but I'm not sure I'd always want to pay the price either. It seems more or less necessary to go to value space first, then write values from this canonically. Some would be pretty easy, like boolean (there's only 4 syntactically valid values, though even this introduces the issue of actually caring about literal validity at all), but things like dates and even decimal seem like much more significant undertakings.

I'll think about how to architecturally add this ability without tossing all the advantages of serd out with the bathwater (being much smaller and faster than everything else is its raison d'être). Maybe support for simpler types can be done elegantly, but I suspect doing this right might be material for serd1 (i.e. an API/ABI break).

drobilla commented 6 years ago

I'm currently working on a major version overhaul, and one thing that stands out is that serd already knows about xsd:decimal, xsd:integer, and xsd:boolean, and has code to go between value and serialization in both directions.

Thing is... "value" is machine integers in this case. So, I think I could probably implement normalization of a sort by simply running through those steps (syntax => value => syntax), but it would be restricted by the range of machine numbers.

For xsd:float, xsd:double, and xsd:boolean, this is a non-issue (the current code to serialize these from values is probably already canonical but I'd have to verify that). I'm not sure about xsd:decimal or xsd:integer though.

I suppose that's enough to probably be useful, though. I could provide a serd_node_canonicalize function that apps like hdt can use while reading if they like, and also add an option to serdi to run nodes through it, without losing any of the speed and perfect round-tripping capabilities.

wouterbeek commented 6 years ago

Interesting stuff! Let's see whether I understand what you are proposing: add support for serializing literals that use a core set of datatype IRIs in a way that is unique to Serd, but that is not necessarily identical to the canonical mappings specified in XSD.

I agree that xsd:float and xsd:double should be fine. C99 seems to introduce support for IEEE 754 floating point support, which is also what XSD is based on.

The xsd:decimal grammar is reused in a large part by several other numeric datatype IRIs. Specifically, xsd:integer is merely a restriction on xsd;decimal.

Another larger collection of datatype IRIs are restrictions of the xsd:dateTime grammar.

If needed, I can help out with testing this.

drobilla commented 6 years ago

Not quite, what's supported would be xsd-canonical, just limited to a pretty small set of types. bool is trivial, and yeah, IEEE 754 has effectively been (more or less) what machines really have for a long time now. So these ones are low-hanging fruit, since a value type already exists.

This is not true for decimal, dateTime, and so on, so I'm not sure about those. Maybe a serialization-to-serialization normalization algorithm exists for those somewhere that isn't too complicated, but I'd have to look into this more. The value space for dateTime does seem pretty simple, basically just a tuple of integers, but the time zone stuff might be a nightmare...

drobilla commented 5 years ago

So I finally got around to doing some real work towards this.

It turns out that, for this approach to work, you (obviously) need a lossless conversion to and from value space, which serd's floating point routines really weren't. So, I've been working on that part. It also turns out that this is a notoriously difficult problem (and unfortunately I can't just use the stdlib here for various reasons), but after many more full days working on that than I'd like, I have a float->string and string->float routine that always uses canonical xsd formats and is lossless (exhaustively tested for all possible floats) .

So, once that's landed, the guts are there for doing float and double properly at least. The other primitive types (bool, int, and so on) should also be easy. As mentioned earlier, a different approach will be necessary for unbounded things like decimal, but after some tinkering around I think some relatively straightforward string hackery can make sure those are canonicalized. So that should cover all the numbers.

Dates and times I'm still not sure about, but I'm really hoping that it's not lossy (or too lossy) to go through the standard C types for that, because I'd really like to not invent my own... conveniently, c18n format for dateTime is always GMT, so I think this might be possible without the time zone support, but we'll see.

The seemingly total absence of existing reusable implementation or even test suites or anything for this stuff is still really weird to me, but so it goes. The XML era was a weird time.

wouterbeek commented 5 years ago

Are you aware of the following test suite: https://www.w3.org/XML/2004/xml-schema-test-suite/index.html, or does it not include the type of tests that you want to perform?

drobilla commented 4 years ago

Nothing useful there as far as I can tell, but (as with so many things xsd) maybe I just can't find it.

Anyway, it turns out that lossless floating point decimal conversion is even harder than I thought, and it was wrong for doubles back then. Several months later, I've aggressively tested that stuff and am pretty confident it actually round-trips for all doubles. I'm not a huge fan of how much additional code that took, but c'est la vie...

So, finally having that, I added some basic c14n for xsd types. I couldn't find anything particularly useful out there, so I just wrote it all myself. I'm not super confident in this compared to all the other battle-hardened and standards-based stuff, but it works for the cases I thought up anyway.

Just numeric and boolean types for now: https://github.com/drobilla/serd/blob/serd1/tests/normalise/test-normalise.ttl goes to https://github.com/drobilla/serd/blob/serd1/tests/normalise/test-normalise.nt

You can use serdi -n on the command line to normalise things, but I also made a branch of HDT here: https://github.com/rdfhdt/hdt-cpp/tree/serd1

I tried to write it conservatively so that invalid things are just passed through, but it's not a validating parser by any stretch of the imagination. This seems like the only viable approach, I do not like the idea of serd having sloppy heuristics that might mangle data. Potentially lossiness makes me a bit nervous, and I'm not sure how to go about attacking that, but if you want to run a bunch of data through it and see what comes out that could be useful.

drobilla commented 4 years ago

I also added https://github.com/drobilla/serd/commit/02125692bc70000ac49623249c187b8fd1849e93 to remove rdf:langString. It's unclear to me if this is really an okay thing to do (it's "invalid" in some sense anyway), but I encountered it in dbpedia dumps where this makes for a lot of useless noise.

drobilla / serd

Add canonical parsing of lexical forms #8