Comparison of MillenniumDB against QLever and Virtuoso

hannahbast commented 2 years ago

Dear authors, I stumbled upon your paper last Friday and read it over the weekend. Here are some comments and questions:

We have also developed an open-source engine for RDF data. It's called QLever and is around for several years already. You find a demo (featuring live-search on the complete Wikidata, with autocompletion) on https://qlever.cs.uni-freiburg.de. On that page, you also find links to various publications (the first one from CIKM'17) and the code on GitHub. Note that QLever has evolved a lot since the CIKM'17 paper.
QLever is based on similar ideas as MillenniumDB. I think you should mention it in your paper and include it in your evaluation. QLever is very easy to set up, even for the complete Wikidata (indexing time < 24 hours). Instructions can be found on the GitHub page and we will be happy to help. And maybe we should talk about joining forces instead of developing two competing open-source engines based on similar ideas? Especially since developing such an engine to maturity is a lot of work, as we know from experience.
In Section 5.1, you claim that your engine is 30 times faster than Virtuoso for very simple queries (consisting of a single triple). We know Virtuoso very well and have compared it with QLever extensively. Virtuoso is a very mature and efficient engine and hard to beat, even on more complex queries. There are natural barriers to what can be achieved, and Virtuoso often (though not always) does the optimal thing. I think one of two things happened in your evaluation: Either you did not configure Virtuoso optimally or you stumbled upon the following artifact. Namely, Virtuoso is rather slow when it has to produce a very large output. That is not a weakness of their query processing engine, but of the way they translate their internal IDs to output IRIs and literals.
We would like to try MillenniumDB ourselves on the complete Wikidata. Can you provide us with the information on how to convert a dump of the complete Wikidata to the format you require for your engine? Also see https://github.com/MillenniumDB/MillenniumDB/issues/9

cirojas commented 2 years ago

Hi, QLever looks interesting and I we would like to run out benchmark on it. However, I encountered some troubles and some help would be apreciated.

I was not able to build the project on my personal computer (Linux Mint 20, based on Ubuntu 20.04) nor our server (Devuan 3, basen on Debian 10), but the docker build seems to work. If there is a noticeable performance between native and docker I would like to work with the native build (maybe you could provide some generic x64_86 glibc build?)

Then I tried to import the filtered wikidata version we used for our benchmark (~1.200 million triples). I did these steps:

Set $QLEVER_HOME
cd $QLEVER_HOME/qlever-indices/wikidata
Put the files filtered_truthy.nt and wikidata.settings.json on that folder
chmod o+w .
sudo docker run -it --rm -v $QLEVER_HOME/qlever-indices/wikidata:/index --entrypoint bash qlever -c "cd /index && cat filtered_truthy.nt | IndexBuilderMain -F nt -f - -l -i wikidata -s wikidata.settings.json | tee wikidata.index-log.txt"

After some minutes the process ended without reporting any errors (I was amazed how fast it was)

The wikidata.settings.json used is:

{
  "languages-internal": ["en"],
  "prefixes-external": [
    "<http://www.wikidata.org/entity/",
    "<http://www.wikidata.org/prop/direct/"
  ],
  "locale": {
          "language": "en",
          "country": "US",
          "ignore-punctuation": true
  },
  "ascii-prefixes-only": true,
  "num-triples-per-partial-vocab" : 50000000
}

The output of the import is:

2021-12-14 15:22:24.239 - INFO:  IndexBuilderMain, version Dec 13 2021 22:16:51
2021-12-14 15:22:24.241 - INFO:  Parsing from stdin.
2021-12-14 15:22:24.241 - INFO:  Reading from uncompressed NTriples file/dev/stdin
2021-12-14 15:22:24.243 - INFO:  Using Locale en US with ignore-punctuation: 1
2021-12-14 15:22:24.256 - WARN:  You specified the ascii-prefixes-only but a parser that is not the Turtle stream parser. This means that this setting is ignored.
2021-12-14 15:22:24.256 - INFO:  Overriding setting num-triples-per-partial-vocab to 50,000,000 This might influence performance / memory usage during index build.
FOXXLL v1.4.99 (prerelease/Release)
foxxll: Disk 'wikidata-stxxl.disk' is allocated, space: 1000000 MiB, I/O implementation: syscall queue=0 devid=0
2021-12-14 15:22:44.974 - INFO:  Lines (from KB-file) processed: 10,000,000
2021-12-14 15:22:56.410 - INFO:  Lines (from KB-file) processed: 20,000,000
2021-12-14 15:23:09.329 - INFO:  Lines (from KB-file) processed: 30,000,000
2021-12-14 15:23:21.408 - INFO:  Lines (from KB-file) processed: 40,000,000
2021-12-14 15:23:34.455 - INFO:  Lines (from KB-file) processed: 50,000,000
2021-12-14 15:23:34.477 - INFO:  Lines (from KB-file) processed: 50,000,000
2021-12-14 15:23:34.477 - INFO:  Actual number of Triples in this section (include langfilter triples): 52,082,814
2021-12-14 15:30:18.285 - INFO:  Lines (from KB-file) processed: 60,000,000

To see the files created I ran ls -lhs

total 147G
146G -rw-rw-r-- 1 crojas  data 146G Dec 14 10:28 filtered_truthy.nt
4.0K -rw-r--r-- 1 icuevas data 1.4K Dec 14 12:30 wikidata.index-log.txt
4.0K -rw-rw-r-- 1 crojas  data  316 Dec 14 12:21 wikidata.settings.json
1.5G -rw-r----- 1 icuevas data 977G Dec 14 12:30 wikidata-stxxl.disk

Finally I tried to start the engine: sudo docker run --rm -v $QLEVER_HOME/qlever-indices/wikidata:/index -p 7001:7001 -e INDEX_PREFIX=wikidata --name qlever.wikidata qlever

But I get the following error:

ServerMain, version Dec 13 2021 22:16:57

Set locale LC_CTYPE to: C.UTF-8
2021-12-14 15:58:04.722 - INFO:  Initializing server...
2021-12-14 15:58:04.722 - ERROR: ASSERT FAILED (f.is_open(); in ../src/index/Index.cpp, line 1185, function void Index::readConfiguration())

jeremiahpslewis commented 2 years ago

FYI, I think that one way of making it easier to try out and experiment with both projects is having consistent binary builds; I have PRs which, once complete, will make it possible to install qlever and MillenniumDB binaries using the Julia packaging system on most platforms. (If the MillenniumDB team has a chance to look at the Cxx blockers, it would be much appreciated). :)

hannahbast commented 2 years ago

@cirojas Thanks for asking, Carlos.

We run all of our experiments with docker. It is faster without using docker, but we haven't benchmarked the difference in a while. Maybe something like 20%.
We are using some new C++ features (in particular: co-routines), which are very useful, but not yet standard (but they will be soon). Hence the FROM ubuntu:21.10 as base at the top of the Dockerfile. When compiling on Ubuntu 20.04, you should see error messages like unrecognized command line option -fcoroutines-ts
If the index builder crashes without any error message, you probably ran out of RAM. We do catch all exceptions in our code, but in a C++ binary failed allocations sometimes lead to a segmentation fault without throwing a std::bad_alloc. Without docker, you would at least see the seg fault, but docker often swallows that.
How much RAM does your machine have? In any case, just reduce "num-triples-per-partial-vocab" to 20 million or even 10 million. Indexing will be a bit slower, but not much slower. Let us know whether it worked.

hannahbast commented 2 years ago

@jeremiahpslewis We are big fans of docker because it makes running code on machines with different setups so much easier. Can you convince us why binary builds are a good idea on top of that? For the current version of the code, there is a slight complication in that it does not compile on a standard Ubuntu 20.04 because it uses some newer C++ features; see above.

jeremiahpslewis commented 2 years ago

Hey @hannahbast I have absolutely nothing against docker, think there are many use cases in which it is an absolute lifesaver. But if you have a platform like https://github.com/JuliaPackaging/Yggdrasil which makes it possible to reliably cross-compile to a variety of platforms, you can then cut out docker as an added layer of complexity and get a performance benefit. Especially if you are developing locally on a Mac, Docker can be a bit of a drag. And the OSS story with the new docker for Mac licenses is a bit of a drag, though not relevant for academic research. I don't mean to claim particular expertise in driving this decision, but my practical experience going from using docker for most projects to using reproducible binaries (via Julia's package system) suggest there's a lot of potential. Ironically, the Yggdrasil project uses docker extensively in BinaryBuilder.jl.

tldr no unambiguous story here, but for local development/testing/experimentation, it would be lovely to run queries against qlever / millenniumdb on an apple silicon Mac ;)

joka921 commented 2 years ago

@cirojas In addition to what @hannahbast said, If you want to run QLever locally on a machine with an older Ubuntu, try the commands from the .github/workflows/cmake.yml. Github actions currently only provide Ubuntu 20.04, so those commands should also work on your Machine.

(The versions of boost and g++ provided in 20.04 are too old, we use PPAs to install them).

And as @hannahbast already said: The IndexBuilder currently uses a lot of RAM, so I would rather suggest 5 or 10 Million as num-triples-per-partial-vocab.

hannahbast commented 2 years ago

Thanks a lot @joka921 ! I executed the first two install lines from the Dockerfile and the five install lines from the cmake.yml and then it worked on Ubuntu 20.04 when telling cmake to use g++-11 as compiler. Here is the complete sequence:

sudo apt-get install -y build-essential cmake libicu-dev tzdata pkg-config uuid-runtime uuid-dev git
sudo apt-get install -y libjemalloc-dev ninja-build libzstd-dev
sudo apt-get install -y libicu-dev tzdata gcc-10 libzstd-dev libjemalloc-dev
sudo add-apt-repository -y ppa:mhier/libboost-latest && sudo apt update && sudo apt install -y libboost1.74-dev
sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test && sudo apt update && sudo apt install -y gcc-11 g++-11
wget https://apt.llvm.org/llvm.sh && sudo chmod +x llvm.sh && sudo ./llvm.sh 13
sudo apt install -y libunwind-13-dev libc++abi-13-dev libc++-13-dev
cmake -DCMAKE_BUILD_TYPE=Release -DLOGLEVEL=DEBUG -DUSE_PARALLEL=true -DCMAKE_CXX_COMPILER=g++-11 -GNinja ..
ninja

hannahbast commented 2 years ago

I just built an index with the latest version of QLever for a small knowledge graph with 362 million triples, without docker (as explained above) and with docker (as explained in https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md), both with num-triples-per-partial-vocab = 10M. The result, on an AMD Ryzen 9 5900X 12/24 x 3,7-4,8 GHz with 128 GB RAM and an HDD (not SSD) raid

Without docker: 18.5 minutes With docker: 20 minutes

So 1B triples should take about one hour either way. There is potential for further speed-up, but 1B triples per hour is already pretty good.

jeremiahpslewis commented 2 years ago

Random question...any chance this can be sped up with a GPU?

hannahbast commented 2 years ago

Random question...any chance this can be sped up with a GPU?

It's more important to interleave IO and computation, which is tricky. If you look at QLever's index building code, you will see that it's pretty sophisticated.

cirojas commented 2 years ago

I tried modifying the parameter num-triples-per-partial-vocab. but I was not able to import our benchmark data. Using 50M and 40M went out of RAM (we have 119GB RAM + 32GB Swap) Using 10M, 20M and 30M gives the same error:

...
2021-12-16 02:36:22.342 - INFO:  Lines (from KB-file) processed: 1,100,000,000
2021-12-16 02:36:30.568 - INFO:  Writing vocabulary to binary file wikidata.partial-vocabulary29
2021-12-16 02:36:35.540 - INFO:  Done writing vocabulary to file.
2021-12-16 02:37:04.007 - INFO:  Writing vocabulary to binary file wikidata.tmp.compression_index.partial-vocabulary29
2021-12-16 02:37:09.021 - INFO:  Done writing vocabulary to file.
2021-12-16 02:37:32.256 - INFO:  Writing vocabulary to binary file wikidata.partial-vocabulary30
2021-12-16 02:37:37.542 - INFO:  Done writing vocabulary to file.
2021-12-16 02:38:05.699 - INFO:  Writing vocabulary to binary file wikidata.tmp.compression_index.partial-vocabulary30
2021-12-16 02:38:11.053 - INFO:  Done writing vocabulary to file.
2021-12-16 02:38:17.814 - ERROR: basic_string::_M_replace_aux

I tried loading the first 100M tuples of the dataset and it worked fine, but in order to run our benchmark we need to load the whole dataset (~1200M tuples).

joka921 commented 2 years ago

@cirojas QLever currently uses a Turtle-Parser (the ascii-prefixes-only option) that has some stricter requirements than standard turtle, and performs very fast but often unchecked parsing. This typically works, because most reasonable datasets fulfill the requirements of this parser ( prefixes only consist of ascii characters, triples always start at a newline, etc..).

In your case, parsing the first 1.1M triples seems to work fine, but then some error occurs. It might be that our parser has our bug, that your input is valid .ttl but uses some of the corner cases that are valid .ttl but make our parser crash, or that your input violates the .ttl grammar.

With the crash being so close to the end of the file, can you check whether there is something strange at the end of your input?

And can you make your input file accessible to me so I can reproduce this error?

cirojas commented 2 years ago

@joka921 The file I used is available here: https://drive.google.com/file/d/1oDkrHT68_v7wfzTxjaRg40F7itb7tVEZ/view?usp=sharing

joka921 commented 2 years ago

@cirojas : Thanks for sending the file. The reason it was not working is, that it was our fault, that you are using our software wrongly:)

You specified -F nt which triggers our really outdated and in many corner-cases wrong ntriples parser. Please try again with -F ttl (the .nt format is a subset of the .ttl format), our turtle parser is much nearer to the specification. I have an IndexBuilder running with your file, and the parsing is already done, which is typically the most critical step, so you should try again with this different parser, sorry for wasting your time and thanks for finding this issue.

Another note: Your settings for "prefixes-external" are strange. All the iris that start with these prefixes are externalized.

"prefixes-external": [
    "<http://www.wikidata.org/entity/",
    "<http://www.wikidata.org/prop/direct/"
  ],

would externalize almost everything, and would harm QLever's performance when large results have to be exported. Since you also run your benchmark with the ID -> String table in RAM, you should not externalize anything on wikidata truthy. (If your queries never export these entities, because they only retrieve human-readable labels via rdfs:label or schema:name etc, then externalizing these entities is fine, but still shouldn't be necessary.

As a comparison, we externalize the following entities on Wikidata (complete):

"prefixes-external": [
       "<http://www.wikidata.org/entity/statement",
       "<http://www.wikidata.org/value",
       "<http://www.wikidata.org/reference"
     ],

(These are only abstract IDs, which one seldomly uses in an output, and they do not occur in the "truthy" flavor of Wikidata).

@hannahbast I will throw this old parser out soon, we now have proof, that it does harm.

hannahbast commented 2 years ago

@joka921 Thanks! I always use -F ttl, even when parsing nt files, that's why I never encountered this bug. I agree that we should throw out the old parser, I have not needed it in a long time

cirojas commented 2 years ago

@joka921 thanks for your help, I was able to load the dataset successfully. I'll do some comparisons soon.

MillenniumDB / MillenniumDB

Comparison of MillenniumDB against QLever and Virtuoso #10