Open ashleysommer opened 3 months ago
Sounds similar to what the folks in the JavaScript RDF ecosystem have standardised on.
There's a bunch of specs listed at https://rdf.js.org/, including quad stream, source, sink, and store interfaces.
Thanks for the link @edmondchuc. I wasn't aware of the work at https://rdf.js.org/ but it makes sense they'd be up to date with the latest new implementation patterns.
Do you envision this overhaul would give access to a streaming interface to graphs stored in files (.ttl
, .jsonld
, etc.)? I.e., not needing to load an entire input file into memory, but instead having a generator as triples become completed (something loosely like xml.etree.ElementTree.iterparse
)?
Also, if I missed that such an interface is currently available, I welcome a pointer.
Do you envision this overhaul would give access to a streaming interface to graphs stored in files ... having a generator as triples become completed
Yes, that is one of the benefits it would provide for parsers like .ttl
, .nt
, .n3
, .nquads
, hextuples
. (Except not for JSON-LD, the nature of JSON decoding using python json loads
or ujson loads
requires the whole document to be loaded into memory and decoded at once).
And you're correct, none of the parser implementations currently in RDFLib provide an interface like that.
Do you envision this overhaul would give access to a streaming interface to graphs stored in files ... having a generator as triples become completed
Yes, that is one of the benefits it would provide for parsers like
.ttl
,.nt
,.n3
,.nquads
,hextuples
. (Except not for JSON-LD, the nature of JSON decoding using python jsonloads
or ujsonloads
requires the whole document to be loaded into memory and decoded at once).
Thank you for clarifying.
On the JSON bits you mentioned - theoretically, if context dictionaries could be guaranteed/agreed-upon to always come first in JSON Objects, could a streaming interface be provided for JSON-LD too, using an interative JSON parser? Is there a name for such a constrained-JSON version of JSON-LD?
Yes that's right, if we switched to using an interactive json parser, and context is expected/guaranteed to be the first object in the document then it would be possible. However I believe it would still be restricted to JSON-LD v1.0, because v1.1 allows additional embedded contexts deeper in the document.
I did have the additional embedded contexts twist in mind when I asked if the context dictionary came first in JSON Objects -- I'd meant that with tolerance for whatever nesting level.
I suppose that specific sort-constrained variant of JSON-LD doesn't have a name, though? Or, a callout in some canonicalization process?
RFC 8785, Section 3.2.3 , which canonicalizes JSON (not JSON-LD), has a sort order prescription for object keys. It unfortunately would sort keys starting with a number before any @
(per ASCII coding), which could put a non-zero portion of a graph before a context dictionary if any properties end up serialized with leading numeric digits.
{
"0d-boundary-points": [{"@id": "60a0ba8a-0fd0-44bb-8d74-fe926e5d7b0b", "@type": "Point"}],
"1d-boundary-lines": [{"@id": "c2af7e4b-bcc4-4d28-842e-fb864374b90a", "@type": "Line"}],
"2d-boundary-surfaces": [{"@id": "fd592932-3f4e-4c2f-8a1e-1bb335228666", "@type": "Surface"}],
"@context": {
"@base": "http://example.org/kb/",
"0d-boundary-points": "http://example.org/ontology/0d-boundary-points",
"1d-boundary-lines": "http://example.org/ontology/1d-boundary-lines",
"2d-boundary-surfaces": "http://example.org/ontology/2d-boundary-surfaces",
"Line": "http://example.org/ontology/Line",
"Point": "http://example.org/ontology/Point",
"Surface": "http://example.org/ontology/Surface",
"label": "http://www.w3.org/2000/01/rdf-schema#label"
},
"@id": "4190cd0b-0cee-4b72-a5f5-8a247a76d428",
"label": "A spatial thing"
}
(That graph does parse like I expect it to - see below for N-Triples form.)
(I don't recall if there is a restriction on IRI local names preventing leading digits, but Turtle's grammar, particularly the PN_LOCAL
terminal via PrefixedName
via iri
, suggest to me it's fine.)
So, it seems to me that even if JSON-LD were passed as canonicalized JSON, there would still need to be some buffering logic in a "streaming" JSON-LD parser to hold graph portions that need to wait for context dictionaries at their object nesting level. And, this streaming design faces some unfortunate cases if properties can be serialized with leading digits.
It seems most, maybe all, of my prior comment is already covered by a specification that is part of JSON-LD 1.1: Streaming JSON-LD.
How could this be exposed in RDFLib? From in Python-space, a flag on the JSON-LD serializer? From the command-line, rdfpipe --format='application/ld+json;profile=http://www.w3.org/ns/json-ld#streaming'
(per Section 3.5) looks ...technically correct, though inviting for a simpler form.
It seems most, maybe all, of my prior comment is already covered by a specification that is part of JSON-LD 1.1: Streaming JSON-LD.
[snip]
On specifically JSON-LD streaming, it's possibly a blocker that the JSON-LD version RDFLib implements should be confirmed as 1.1 or not (Issue 2996).
A new interface would break existing parsers, right? I would still appreciate a new interface. The old one is not fun to work with. Also in the current one, afaik there is no autodocumentation for the current implementation, so optional parameters are by default very obscure.
It has been on my mind for a couple of years that the existing Parser and Serializer interface is a mess.
The current standard Parser interface predates multigraph support (from before
Dataset
and even beforeConjunctiveGraph
) so there is a terrible workaround that involves building aGraph
instance using thestore
backend of a multigraph, passing that to the parser, then the parser will then create a newConjunctiveGraph
instance internally using the samestore
backend as theGraph
, then parsing into that.The current interface predates unicode support in Python. All parsers and serializers originally consumed or emitted a ByteStream. Unicode support was later hacked into some but not all parsers and serializers, for the use in Python 2.6. When Python3 changed
unicode
tostr
, and old-str
tobytes
, this complicated things a whole bunch, and the parser/serializer interface still hasn't recovered.There has been some recent work done to rectify some of these problems (and fix some very long standing bugs) that will be released as part of RDFLib v7.1.0, but its hard to make many drastic changes without introducing breaking changes to the Parser/Serializer interface.
Recently I had the privilege of reading some of the Oxigraph source code (its a very well written and extremely high performance Rust-based sparql-engine with RDF Parser and Serializer support). I noticed a pattern in the Oxigraph source code that inspired some further thought about completely redesigning the RDFLib parser/serializer interface.
Oxigraph implements parsers as a Quad-source, and serializers as a Quad-sink. Constructing a parser (giving it a file to parse) returns an Generator object. Iterating the generator will cause the parser to yield quads, as it is parsing the file. Similarly, invoking a serializer requires passing in an iterator over a set of quads. The fun byproduct of this pattern is you can implement a format converter simply by piping a parser into a serializer. No Graph needed.
Python supports everything required to make this pattern work in RDFLib. Python even has Async Generators, so we could use this interface to support concurrent async parsing and serializing.
This post is just the introduction to this idea. I want to write up some example code and see what is required to put together a new "RDFLib Serailizer Interface Standard" and a "RDFLib Parser Interface Standard", that allows all parsers and serializers to be implemented in a common manner. I want to get feedback and suggestions from contributors and interested parties, so we can make this as useful as possible for everyone.