ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
428 stars 52 forks source link

High-level overview of indexing #1056

Open turbolent opened 1 year ago

turbolent commented 1 year ago

Hi there!

I came across qlever while researching the implementations of RDF graph databases.

Though I read the "Knowledge-Base Index" section in the CIKM'17 paper paper and the "Engines and indexing" chapter in the 2023 book chapter, as well as tried to read through src/index (especially IndexImpl), it is still a bit unclear to me what the high-level process is for indexing an input file of unsorted triples (e.g. from a Turtle or NTriples files).

Could you please share some high-level steps that are performed, ideally leaving out optimizations like parallel processing?

For example, after parsing a triple from the input file, the IDs are derived for the triple parts. How are the the individual permutations generated? Are all triples first written in ID representation written to a temporary place (e.g. temporary file or memory-mapped data structure), and then for each permutation/index, all data in ID representation is re-sorted and re-read/processed to create the current permutation's index?

Such a high-level explanation might be useful as developer documentation and could help onboard new contributors.

turbolent commented 1 month ago

@joka921 Would you be able to answer these questions?

hannahbast commented 1 month ago

@turbolent We are in the process of writing a paper on QLever's architecture and everything that makes it special compared to other engines. On a high level, the indexing proceeds as follows + you can see all these steps very transparently in the log:

  1. Parse the input triples in batches, creating quads of IDs (quad = triple + graph ID), where the mapping from ID to IRI/literal is stored in a local vocabulary for that batch. You can see these files on disk while indexing.
  2. Merge the local vocabularies into one global vocabulary and remember the mapping from local to global IDs.
  3. Rewrite the local IDs to global IDs; at this point, we have all quads in ID space
  4. Now sort these quads for the six permutations, in this order SPOG&SOPG, OPSG&OSPG, PSOG&POSG.

The triples are stored and sorted in highly compressed form, which is very important to keep the RAM and disk consumption low even for very large knowledge graphs. Uncompressed, one ID is stored in 8 bytes, that is, 1B quads would require 32 GB. For example, Wikidata has 30B quads (20B from the input data, 10B added internally by QLever for optimization purposes). So storing the quads once uncompressed would consume 1 TB, and storing six permutations would consume 6 TB. But QLever's total index size for Wikidata is only 0.5 TB.

One important general remark: on a high level, the various engines on the market -- commercial and open-source -- all do similar things. For example, all engines store permutations in one form or the other. But on the mid and low level, there are dozens of small and big algorithmic problems that need to be solved, and any given algorithm can be implemented more or less efficiently. Finding the best algorithm and implementing it as efficiently as possible on contemporary hardware is what we are particularly good at, and QLever is full of gems in this respect. This + the right design choices is what ultimately gives QLever its edge over other engines.

turbolent commented 1 month ago

Thank you for the great explanation Hannah, appreciate it! 👍

I'll have another look through the code with this context at hand. Best of luck with your paper, I'm looking forward to reading it