RDFLib / pySHACL

A Python validator for SHACL
Apache License 2.0
245 stars 63 forks source link

Use Oxigraph as default RDFlib backend/store #184

Closed daniel-dona closed 1 year ago

daniel-dona commented 1 year ago

Just as a possible speed improvement could be great to use Oxigraph as the backend for RDFlib. This could improve graph traversals but also SPARQL queries used on SHACL.

https://pyoxigraph.readthedocs.io/en/stable/

https://github.com/oxigraph/oxrdflib

ashleysommer commented 1 year ago

Hi @daniel-dona I've done some testing, and as expected I found the Python in-memory graph in RDFLib is faster than Oxigraph.

Built in PySHACL Benchmark

RDFLib Memory:
With no inferencing: 0.0028500853500008816 seconds
With rdfs inferencing: 0.01648246115000802 seconds
With owl-rl inferencing: 0.04122918072003813 seconds
With both inferencing: 0.22119805684000313 seconds

Oxigraph:
With no inferencing: 0.005864868260032381 seconds
With rdfs inferencing: 0.01864628946997982 seconds
With owl-rl inferencing: 0.04554426107999461 seconds
With both inferencing: 0.23374986606002493 seconds

Runtime of W3C SHACL Test Suite (SHT):

RDFLib Memory: 2.945 seconds

Oxigraph: 3.344 seconds

Runtime of Datashapes Core Test Suite (DASH):

RDFLib Memory: 1.894 seconds

Oxigraph: 2.287 seconds

Tested using Python 3.10.10, rdflib v6.3.2, oxrdflib v0.3.4, pyoxigraph v0.3.16

These results are consistent with those I have seen before.

Back in 2018 I was deep in the Rust ecosystem, I made a bunch of simple RDF/Triplestore libraries in Rust, and used a very early version of PyO3 to make Python bindings. I wrote lots of tests and benchmarks but found none could perform better as a RDFLib store than the Python Memory graph in RDFLib. The bottleneck is in the transferring of data objects between Rust and Python. RDF operations using RDFLib involve moving a lot of string objects back and forth between the application and the store. In my tests I found the overhead of converting these objects through PyO3 to rust objects, and subsequently converting the rust results back to python objects was greater than any performance benefit gained by a faster store.

Eventually in my testing back then I was able to write an extremely minimal in-memory store backend in Rust, that mimics the way the way the Python Memory store works, optimised it to avoid as may string copies as possible, optimised the PyO3 bindings to reduce object translation and string copying as much as possible, and I managed to get it to perform on par with the Python memory store in RDFLib. So I concluded its simply not worth the effort.

Pyoxigraph is also slower because it uses the RocksDB backend as its storage layer. RocksDB is a modern high performance key-value DB, but it it is not faster than a bare Python Dict in this application. So with oxrdflib you have four layers of indirection between the application and the store: oxrdflib->pyoxigraph->PyO3 >Oxigraph->RocksDB