Open turbolent opened 1 year ago
@joka921 Would you be able to answer these questions?
@turbolent We are in the process of writing a paper on QLever's architecture and everything that makes it special compared to other engines. On a high level, the indexing proceeds as follows + you can see all these steps very transparently in the log:
The triples are stored and sorted in highly compressed form, which is very important to keep the RAM and disk consumption low even for very large knowledge graphs. Uncompressed, one ID is stored in 8 bytes, that is, 1B quads would require 32 GB. For example, Wikidata has 30B quads (20B from the input data, 10B added internally by QLever for optimization purposes). So storing the quads once uncompressed would consume 1 TB, and storing six permutations would consume 6 TB. But QLever's total index size for Wikidata is only 0.5 TB.
One important general remark: on a high level, the various engines on the market -- commercial and open-source -- all do similar things. For example, all engines store permutations in one form or the other. But on the mid and low level, there are dozens of small and big algorithmic problems that need to be solved, and any given algorithm can be implemented more or less efficiently. Finding the best algorithm and implementing it as efficiently as possible on contemporary hardware is what we are particularly good at, and QLever is full of gems in this respect. This + the right design choices is what ultimately gives QLever its edge over other engines.
Thank you for the great explanation Hannah, appreciate it! 👍
I'll have another look through the code with this context at hand. Best of luck with your paper, I'm looking forward to reading it
Hi there!
I came across qlever while researching the implementations of RDF graph databases.
Though I read the "Knowledge-Base Index" section in the CIKM'17 paper paper and the "Engines and indexing" chapter in the 2023 book chapter, as well as tried to read through
src/index
(especiallyIndexImpl
), it is still a bit unclear to me what the high-level process is for indexing an input file of unsorted triples (e.g. from a Turtle or NTriples files).Could you please share some high-level steps that are performed, ideally leaving out optimizations like parallel processing?
For example, after parsing a triple from the input file, the IDs are derived for the triple parts. How are the the individual permutations generated? Are all triples first written in ID representation written to a temporary place (e.g. temporary file or memory-mapped data structure), and then for each permutation/index, all data in ID representation is re-sorted and re-read/processed to create the current permutation's index?
Such a high-level explanation might be useful as developer documentation and could help onboard new contributors.