DerwenAI / kglab

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.
https://derwen.ai/docs/kgl/
MIT License
574 stars 65 forks source link

Initial implementation for RDF-based tests QueryEvaluationTest #249

Closed Mec-iS closed 2 years ago

Mec-iS commented 2 years ago

From #248

This is the first draft for implementing automated RDF tests. The first batch of test from RDF-tests (basic) can be run with pytest tests/rdf_tests/test_rdf_basic.py -k test_rdf_runner -s from the project directory; tests are in tests/rdf_tests/dat. For the moment the only assertion implemented is a check on the length of returned results. This is already interesting as some kg query outputs return more or less bindings compared to the expected ones.

The script should work also on oxigraph-tests but it doesn't, it seems the read_manifest file provided with rdflib cannot parse the oxigraph manifest, maybe a discrepancy in the XML structure? @Tpt please provide some feedback.

Currently the tests are cut-pasted.

Tests with anomalies:

The ones resulting in "False" mean length discrepancies between expected and actual output, the one with errors are exceptions raised.

Tpt commented 2 years ago

By the way, Oxigraph does not pass all official SPARQL tests. This is mostly due to the fact that Oxigraph storage normalizes literals like numbers. For example "01"^^xsd:integer and "1"^^xsd:integer are considered to be the same. This leads to the failure of some tests taking care of duplicates. The list of failing test is here.

Mec-iS commented 2 years ago

Is there any straightforward way of making rdflib SPARQL query to return serialised data (ttl or xml) instead of the row iterator?

or what is the given way of testing the result of an rdflib query to a given ttl file? @ceteri

Tpt commented 2 years ago

Is there any straightforward way of making rdflib SPARQL query to return serialised data (ttl or xml) instead of the row iterator?

or what is the given way of testing the result of an rdflib query to a given ttl file? @ceteri

RDFlib provides a parser for SPARQL results encoded in RDF. Then it uses this function to check if results sets are compatible.

I tried in Oxigraph to compare results set encoded in RDF using the graph isomorphism algorithm. It was very slow because results sets encoded in RDF contains a lot of blank nodes only connected to other blank nodes, making the hash based alogirthms not very efficient...

Mec-iS commented 2 years ago

In some tests kglab fails, most of them are "Python recursion limit exceeded".

ceteri commented 2 years ago

@Mec-iS this is excellent, and @Tpt thank you kindly!

ceteri commented 2 years ago

In terms of graph isomorphism algorithms, I wish there was more open source available for graph sketch algorithms. That might help with the costs. I've only found one https://github.com/kenkoooo/graph-sketch-fractality although it's based on a CLI and not quite the similarity measures that we'd need.