drobilla / serd

A lightweight C library for RDF syntax
https://gitlab.com/drobilla/serd
ISC License
86 stars 15 forks source link

ShEx support #23

Open ericprud opened 5 years ago

ericprud commented 5 years ago

Feel like working with me on ShEx support?

drobilla commented 5 years ago

Sure, though I'm not really familiar with ShEx at all. I will read up and get a better idea of how this might fit into serd. Were you just thinking of syntax support, or...?

ericprud commented 5 years ago

I was thinking of a full implementation. it's not a ton of code and I already have the yacc. Though I guess you don't already have yacc and bison in your build dependencies.

drobilla commented 5 years ago

Yeah, there are no dependencies at all. Some of the unique things about serd stem from it being a hand hacked parser, which is pretty tedious to write and maintain but lets me control everything.

Are you thinking ShexC? That would probably be quite some work, but ShexJ would be easier since I already have JSON reading code lying around (even though JSON-LD isn't in master yet... herculean effort, that one, and serd could only ever support a subset since the spec doesn't allow streaming. Oh well)

drobilla commented 5 years ago

Although that makes me realize an important question: can Shex be parsed as a stream (i.e. emitted as a sequence of statement(s, p, o) calls in the same order they are found in the document) without significant readahead? Serd is fundamentally based on this, things that can't stream don't really fit.

(Sorry if this is obvious, I haven't found the time to read the spec in detail yet)

ericprud commented 5 years ago

ShExJ makes a lot of sense. There are plenty of tools to convert between ShExC and ShExJ if folks want to work in ShExC.

Re streaming, I guess everything is stream-able if you are willing to buffer enough. I believe @iovka and Jérémie Dusart are working on something related to this. The challenge is that validation is typically top-down, e.g. you start by validating <Obs1> as <ObservationShape>. In the process of that, you must then validate <Patient2>@<PatientShape>. The big challenge is: at what point do you decide you've seen all of the triples related to <Obs1> or <Patient2>?

This is similar to the problem of serialization; at some point you decide that you're not waiting for more triples from some node and you go ahead and write a . or ]. (Making a bad call doesn't seem as dire in serialization because you can always write a node out again, but that's not true of an anonymous blank node.) While we can construct screw cases, I expect we can address a lot of bulk-validation use cases with some heuristics to say when we assume we have all arcs out of node. Particularly easy would be nested anonymous BNodes such as what you see in FHIR/RDF.

drobilla commented 5 years ago

I think needing a model for validation itself is fine, and assumed that'd be the case, though streaming validation would be awesome if possible.

To support reading Shex* and writing the corresponding Turtle (or building a model out of it), though, that would need to be streamable. Essentially in order to parse a file, serd needs to be able to spit it out as triples as it goes. Seems like this should be possible here (maybe with some restrictions on key order, as it goes with JSON-LD, but I'm not sure in this case).

I imagine it would look something like parsing the ShEx file into a model, then having a function that takes that, and a data model, and validates one against the other (or, alternatively, just mash them all in the same model if that makes sense for ShEx).