Open plasticfist opened 2 years ago
This is a holdover from back in the day when NTriples was ASCII. serd now supports RDF 1.1 NTriples, which is UTF-8, but the command-line tool behaviour is still the same. The upcoming major version is more precise about this and lets you mix and match all kinds of options to get what you want.
I'm not sure if the default could be changed without breaking things for people in the current version. Maybe? I agree that the option existing (it's meant for Turtle) makes this confusing, but I'm hesitant to change it and potentially break people's existing scripts/workflows/whatever...
For reference, this is how the new command-line tool interfaces look: https://drobilla.net/files/serd_man_pages/ where serd-pipe
is the closest thing to serdi
. So the default will be UTF-8 everywhere, but you can -O ascii
to ASCIIfy any syntax. This also lets you do nice things like write a "flat Turtle" file, like NTriples but with namespace prefixes, and so on.
I understand and can empathize with backwards compatibility, but the (current) specs seemed to be clear on this question, or I thought so on first read.
Quote: "The content encoding of N-Triples is always UTF-8." Reference: 6. Media Type and Content Encoding
That said, I have to say they seem to walk back on the clear directive in section 6.1 (if doc is plain/text it would be ASCII and escaped, etc..) I guess this gets into the nuances of "web document types" as opposed to files, so when working outside that frame work it is left up to individual interpretation. sigh.
ASCII is a subset of UTF-8. In other words, the output of serdi
is UTF-8, and valid N-Triples.
It's not canonical RDF 1.1 N-Triples though, because escaping like this is not allowed there (see link in OP).
Ok, I'll rephrase ticket request, would like command line tool that outputs canonical N-Triples. (no escaped characters) Whether you make it the default or not is up to you, as long as it is possible. I wouldn't mind adding --canonical to the command line if required. No worries here.
Sure, I was just responding to the above comment. If you want this right now, I suggest building the serd1
branch from git and using serd-pipe
. My top priority is getting the new major version out, there will probably not be any more non-trivial releases of 0.x.x.
I'll make a note to double-check the other canonical rules and make sure that the default output adheres to them, but I think it does.
(Edited) The output does not appear to be UTF-8, is this is a bug? I thought UTF-8 would be the default given there is an option to "Write ASCII output if possible"
Example:
source triple from dbpedia/article-templates_lang\=en_nested.ttl
<http://dbpedia.org/resource/André_Éric_Létourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .
$ file article-templates_lang\=en_nested.ttl article-templates_lang=en_nested.ttl: UTF-8 Unicode text
serdi output:
<http://dbpedia.org/resource/Andr\u00E9_\u00C9ric_L\u00E9tourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .
$ file article-templates_lang\=en_nested-serdi.nt article-templates_lang=en_nested-serdi.nt: ASCII text, with very long lines
apache jena riot output:
<http://dbpedia.org/resource/André_Éric_Létourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .
$ file article-templates_lang\=en_nested.ttl.bz2-riot.nt article-templates_lang=en_nested.ttl.bz2-riot.nt: UTF-8 Unicode text
Spec Reference: https://www.w3.org/TR/n-triples/#canonical-ntriples
Note: At first I thought maybe this was a BOM related rendering/display issue, but file would reveal if there is a BOM, and the same tools were used to find and display the examples above...