RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.15k stars 555 forks source link

[PROPOSAL] New Parser and Serializer interface #2897

Open ashleysommer opened 3 weeks ago

ashleysommer commented 3 weeks ago

It has been on my mind for a couple of years that the existing Parser and Serializer interface is a mess.

The current standard Parser interface predates multigraph support (from before Dataset and even before ConjunctiveGraph) so there is a terrible workaround that involves building a Graph instance using the store backend of a multigraph, passing that to the parser, then the parser will then create a new ConjunctiveGraph instance internally using the same store backend as the Graph, then parsing into that.

The current interface predates unicode support in Python. All parsers and serializers originally consumed or emitted a ByteStream. Unicode support was later hacked into some but not all parsers and serializers, for the use in Python 2.6. When Python3 changed unicode to str, and old-str to bytes, this complicated things a whole bunch, and the parser/serializer interface still hasn't recovered.

There has been some recent work done to rectify some of these problems (and fix some very long standing bugs) that will be released as part of RDFLib v7.1.0, but its hard to make many drastic changes without introducing breaking changes to the Parser/Serializer interface.

Recently I had the privilege of reading some of the Oxigraph source code (its a very well written and extremely high performance Rust-based sparql-engine with RDF Parser and Serializer support). I noticed a pattern in the Oxigraph source code that inspired some further thought about completely redesigning the RDFLib parser/serializer interface.

Oxigraph implements parsers as a Quad-source, and serializers as a Quad-sink. Constructing a parser (giving it a file to parse) returns an Generator object. Iterating the generator will cause the parser to yield quads, as it is parsing the file. Similarly, invoking a serializer requires passing in an iterator over a set of quads. The fun byproduct of this pattern is you can implement a format converter simply by piping a parser into a serializer. No Graph needed.

Python supports everything required to make this pattern work in RDFLib. Python even has Async Generators, so we could use this interface to support concurrent async parsing and serializing.

This post is just the introduction to this idea. I want to write up some example code and see what is required to put together a new "RDFLib Serailizer Interface Standard" and a "RDFLib Parser Interface Standard", that allows all parsers and serializers to be implemented in a common manner. I want to get feedback and suggestions from contributors and interested parties, so we can make this as useful as possible for everyone.

edmondchuc commented 2 weeks ago

Sounds similar to what the folks in the JavaScript RDF ecosystem have standardised on.

There's a bunch of specs listed at https://rdf.js.org/, including quad stream, source, sink, and store interfaces.

ashleysommer commented 2 weeks ago

Thanks for the link @edmondchuc. I wasn't aware of the work at https://rdf.js.org/ but it makes sense they'd be up to date with the latest new implementation patterns.

ajnelson-nist commented 2 weeks ago

Do you envision this overhaul would give access to a streaming interface to graphs stored in files (.ttl, .jsonld, etc.)? I.e., not needing to load an entire input file into memory, but instead having a generator as triples become completed (something loosely like xml.etree.ElementTree.iterparse)?

Also, if I missed that such an interface is currently available, I welcome a pointer.

ashleysommer commented 2 weeks ago

Do you envision this overhaul would give access to a streaming interface to graphs stored in files ... having a generator as triples become completed

Yes, that is one of the benefits it would provide for parsers like .ttl, .nt, .n3, .nquads, hextuples. (Except not for JSON-LD, the nature of JSON decoding using python json loads or ujson loads requires the whole document to be loaded into memory and decoded at once).

And you're correct, none of the parser implementations currently in RDFLib provide an interface like that.

ajnelson-nist commented 2 weeks ago

Do you envision this overhaul would give access to a streaming interface to graphs stored in files ... having a generator as triples become completed

Yes, that is one of the benefits it would provide for parsers like .ttl, .nt, .n3, .nquads, hextuples. (Except not for JSON-LD, the nature of JSON decoding using python json loads or ujson loads requires the whole document to be loaded into memory and decoded at once).

Thank you for clarifying.

On the JSON bits you mentioned - theoretically, if context dictionaries could be guaranteed/agreed-upon to always come first in JSON Objects, could a streaming interface be provided for JSON-LD too, using an interative JSON parser? Is there a name for such a constrained-JSON version of JSON-LD?

ashleysommer commented 2 weeks ago

Yes that's right, it we switched to using an interactive json parser, and context is expected/guaranteed to be the first object in the document then it would be possible. However I believe it would still be restricted to JSON-LD v1.0, because v1.1 allows additional embedded contexts deeper in the document.

ajnelson-nist commented 2 weeks ago

I did have the additional embedded contexts twist in mind when I asked if the context dictionary came first in JSON Objects -- I'd meant that with tolerance for whatever nesting level.

I suppose that specific sort-constrained variant of JSON-LD doesn't have a name, though? Or, a callout in some canonicalization process?

RFC 8785, Section 3.2.3 , which canonicalizes JSON (not JSON-LD), has a sort order prescription for object keys. It unfortunately would sort keys starting with a number before any @ (per ASCII coding), which could put a non-zero portion of a graph before a context dictionary if any properties end up serialized with leading numeric digits.

{
    "0d-boundary-points": [{"@id": "60a0ba8a-0fd0-44bb-8d74-fe926e5d7b0b", "@type": "Point"}],
    "1d-boundary-lines": [{"@id": "c2af7e4b-bcc4-4d28-842e-fb864374b90a", "@type": "Line"}],
    "2d-boundary-surfaces": [{"@id": "fd592932-3f4e-4c2f-8a1e-1bb335228666", "@type": "Surface"}],
    "@context": {
        "@base": "http://example.org/kb/",
        "0d-boundary-points": "http://example.org/ontology/0d-boundary-points",
        "1d-boundary-lines": "http://example.org/ontology/1d-boundary-lines",
        "2d-boundary-surfaces": "http://example.org/ontology/2d-boundary-surfaces",
        "Line": "http://example.org/ontology/Line",
        "Point": "http://example.org/ontology/Point",
        "Surface": "http://example.org/ontology/Surface",
        "label": "http://www.w3.org/2000/01/rdf-schema#label"
    },
    "@id": "4190cd0b-0cee-4b72-a5f5-8a247a76d428",
    "label": "A spatial thing"
}

(That graph does parse like I expect it to - see below for N-Triples form.)

N-Triples render of JSON-LD snippet Rendered using[^1] the command `rdfpipe --output-format nt example.jsonld | sort`. [^1]: Participation by NIST in the creation of the documentation of mentioned software is not intended to imply a recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that any specific software is necessarily the best available for the purpose. ```turtle . . . "A spatial thing" . . . . ```

(I don't recall if there is a restriction on IRI local names preventing leading digits, but Turtle's grammar, particularly the PN_LOCAL terminal via PrefixedName via iri, suggest to me it's fine.)

So, it seems to me that even if JSON-LD were passed as canonicalized JSON, there would still need to be some buffering logic in a "streaming" JSON-LD parser to hold graph portions that need to wait for context dictionaries at their object nesting level. And, this streaming design faces some unfortunate cases if properties can be serialized with leading digits.