Sveino / Inst4CIM-KG

Instance of CIM Knowledge Graph
Apache License 2.0
5 stars 1 forks source link

which SHACL validators to try? #95

Open VladimirAlexiev opened 1 month ago

VladimirAlexiev commented 1 month ago

Requirements:

Let's define which validation engines to try. Here's a proposal:

How about:

After we agree on the list, we need to research and list the limitations of every implementation. This may eliminate some candidates.

@HarisVranaj please attach the presentation you showed 2d ago (I hope it's not confidential). @griddigit-ci and @Sveino please comment on the proposal above, and I'll correct the list

Sveino commented 1 month ago

I agree on the requirement. First of all, we would like to have the UML/information model so that we really only need UML restriction. However, the world is more complicated. To avoid to have very technical UML/information model we will use a logical description of the constraints. This does not really need to be processes as is, but can be converted to relevant execution. This should be the primary motivation for not including SPARQL. Secondary is that we want to have engines that is optimised to execute well known constraints pattern. So our primary test of the SHACL validation engines is to test our SHACL that we are applying rules that are not wrong understanding or bias to a particularly implementation. I agree with the priorities and the argument for picking them. If we should add any addition, I would considered pySHACL. The reason for this is that a lot of TSOs are start using Python for Power Engineers. In addition Nick Car is a core developer. They have also boosted that they have the most complete coverage of the SHACL rules.

HarisVranaj commented 1 month ago

I have some suggestion from Erik for benchmarks in SHACL/SPARQL validator.

https://github.com/oxigraph/oxigraph/blob/main/bench/README.md https://github.com/ad-freiburg/qlever/wiki/QLever-performance-evaluation-and-comparison-to-other-SPARQL-engines Oxigraph is now optimised for memory usage (no longer using the rocksdb engine when using in memory) which on Erik's machine is 4 times faster then earlier versions (as this is including unzipping the file, real performance will even be better).

VladimirAlexiev commented 4 weeks ago

@HarisVranaj but do Oxigraph and QLever have SHACL implementations? Please post links so I can include in https://github.com/VladimirAlexiev/awesome-semantic-shapes#shacl-validators and thereon to https://github.com/w3c-cg/awesome-semantic-shapes

VladimirAlexiev commented 4 weeks ago

Note to self: https://mail.google.com/mail/u/0/#sent/QgrcJHsTgsbXhdCJwNqzTbwQHVhdRXDHtBB asked Treehouse for access to maplib SHACL.

VladimirAlexiev commented 3 weeks ago

@Sveino points out that rdf4j 5.0.0 and 5.0.3 have some SHACL improvements:

And more are planned to be completed by 5.0.3 is released

GraphDB will upgrade to rdf4j 5.0 at the end of the year.

griddigit-ci commented 3 weeks ago

When I tried pySHACL back in Jan and tried to package ModShape I has troubles. I was having performance issues. I was in touch with Nick at that time, there might be solutions, but I didn't have time to clean that up.

HarisVranaj commented 3 weeks ago

https://github.com/ad-freiburg/qlever https://github.com/oxigraph/oxigraph/tree/main

VladimirAlexiev commented 2 weeks ago

@HarisVranaj Do Qlever and OxiGraph support SHACL? Please post links to documentation

hmottestad commented 2 weeks ago

I'm also working on supporting the last of the SHACL path expressions, and this should be included in RDF4J 5.1.0 or 5.2.0: https://github.com/eclipse-rdf4j/rdf4j/pull/5131

I can also advertise that the RDF4J SHACL implementation supports incremental validation. If you have a large database and want to make a small change to your data, then the RDF4J SHACL engine will analyse your changes and only validate the affected target nodes.

Sveino commented 2 weeks ago

@hmottestad Very good. Incremental or difference validation is extremely relevant since we have a lot of SHACL rules that goes across multiple objects. The full graph is getting very big, and the changes are very limited. We have included the possiblity to exchange differences since 2005 using CIMXML/ RDFXML. We are not looking into how we can use JSON-LD to exchange this. See #53

hmottestad commented 2 weeks ago

RDF4J 5 has support for JSON-LD 1.1 with a customised version of Titanium JSON-LD that is considerably faster than the stock implementation that Jena is using.

I saw you were talking about DCAT, is your projected related to Datakatalogen på any chance?

Sveino commented 1 week ago

I like fast code :-) The use of DCAT has two purposes. One is providing the header information on the dataset/named graph. We expect the same information to be linked to a Catalog so that the dataset/named graph can be found. So second purpose is to support data catalog (Datakatalog).

HarisVranaj commented 1 week ago

Reply from Erik. "They do not support it out of the box, only SPARQL, for them, SHACL needs to be translated into SPARQL this one does https://github.com/DataTreehouse/maplib , they say python, but it's actually written in Rust with a Python API, but can be used as basis to create full app in Rust. " @Sveino can you give access to him.

Sveino commented 1 week ago

@HarisVranaj I am not able to give access to DataTreehouse Github, if that is what you wanted.

HarisVranaj commented 1 week ago

nono to this repository.

VladimirAlexiev commented 1 week ago

The repo is public, so Erik can post and comment