drobilla / serd

A lightweight C library for RDF syntax
https://gitlab.com/drobilla/serd
ISC License
86 stars 15 forks source link

Write canonical NTriples 1.1 by default #35

Open plasticfist opened 2 years ago

plasticfist commented 2 years ago

(Edited) The output does not appear to be UTF-8, is this is a bug? I thought UTF-8 would be the default given there is an option to "Write ASCII output if possible"

Example:

source triple from dbpedia/article-templates_lang\=en_nested.ttl <http://dbpedia.org/resource/André_Éric_Létourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang\=en_nested.ttl article-templates_lang=en_nested.ttl: UTF-8 Unicode text

serdi output: <http://dbpedia.org/resource/Andr\u00E9_\u00C9ric_L\u00E9tourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang\=en_nested-serdi.nt article-templates_lang=en_nested-serdi.nt: ASCII text, with very long lines

apache jena riot output: <http://dbpedia.org/resource/André_Éric_Létourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang\=en_nested.ttl.bz2-riot.nt article-templates_lang=en_nested.ttl.bz2-riot.nt: UTF-8 Unicode text

Spec Reference: https://www.w3.org/TR/n-triples/#canonical-ntriples

Note: At first I thought maybe this was a BOM related rendering/display issue, but file would reveal if there is a BOM, and the same tools were used to find and display the examples above...

drobilla commented 2 years ago

This is a holdover from back in the day when NTriples was ASCII. serd now supports RDF 1.1 NTriples, which is UTF-8, but the command-line tool behaviour is still the same. The upcoming major version is more precise about this and lets you mix and match all kinds of options to get what you want.

I'm not sure if the default could be changed without breaking things for people in the current version. Maybe? I agree that the option existing (it's meant for Turtle) makes this confusing, but I'm hesitant to change it and potentially break people's existing scripts/workflows/whatever...

drobilla commented 2 years ago

For reference, this is how the new command-line tool interfaces look: https://drobilla.net/files/serd_man_pages/ where serd-pipe is the closest thing to serdi. So the default will be UTF-8 everywhere, but you can -O ascii to ASCIIfy any syntax. This also lets you do nice things like write a "flat Turtle" file, like NTriples but with namespace prefixes, and so on.

joelduerksen commented 2 years ago

I understand and can empathize with backwards compatibility, but the (current) specs seemed to be clear on this question, or I thought so on first read.

Quote: "The content encoding of N-Triples is always UTF-8." Reference: 6. Media Type and Content Encoding

That said, I have to say they seem to walk back on the clear directive in section 6.1 (if doc is plain/text it would be ASCII and escaped, etc..) I guess this gets into the nuances of "web document types" as opposed to files, so when working outside that frame work it is left up to individual interpretation. sigh.

drobilla commented 2 years ago

ASCII is a subset of UTF-8. In other words, the output of serdi is UTF-8, and valid N-Triples.

It's not canonical RDF 1.1 N-Triples though, because escaping like this is not allowed there (see link in OP).

plasticfist commented 2 years ago

Ok, I'll rephrase ticket request, would like command line tool that outputs canonical N-Triples. (no escaped characters) Whether you make it the default or not is up to you, as long as it is possible. I wouldn't mind adding --canonical to the command line if required. No worries here.

drobilla commented 2 years ago

Sure, I was just responding to the above comment. If you want this right now, I suggest building the serd1 branch from git and using serd-pipe. My top priority is getting the new major version out, there will probably not be any more non-trivial releases of 0.x.x.

I'll make a note to double-check the other canonical rules and make sure that the default output adheres to them, but I think it does.