KonradHoeffner / hdt

Library for the Header Dictionary Triples (HDT) compression file format for RDF data.
https://crates.io/crates/hdt
MIT License
20 stars 4 forks source link

triple pattern queries #3

Closed KonradHoeffner closed 2 years ago

KonradHoeffner commented 2 years ago

Right now there is only an option to iterate over all triples, which is inefficient for large graphs. Implement triple pattern queries and add tests.

chrysn commented 2 years ago

Once these are in, this might make a good backend for SPARQL engines as well, especially for data sets exceeding RAM, which might then be efficiently usable without the further need to create a dedicated SPARQL-engine-specific index on.

KonradHoeffner commented 2 years ago

Good point! This is even mentioned in the RDF HDT file format internals documentation. Quote:

Nevertheless, it does not support all kind of triple patterns directly on that structure, it is restricted to SPO, SP?, S?? and ??? queries. Thus, once the HDT-encoded dataset is loaded into the memory hierarchy, we slightly enrich the representation with additional succinct data structures to support the remaining triple patterns to be solved efficiently. The final fully queryable representation is called HDT-FoQ: HDT Focused on Querying. In the following we briefly present the main extensions. The technical specification of the structures can be found in our publication (Martínez-Prieto, Arias, and Fernández 2012).

However I think this specific use case of a backend for a SPARQL engine with data exceeding RAM seems to be a already covered by the triple store library terminus-store, whose format is based on HDT. I have not used terminusdb-store but it seems to have functions for persisting data on disk, while when using this library you would have to implement all that.