ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
429 stars 52 forks source link

qlever -wi fails #563

Closed WolfgangFahl closed 2 years ago

WolfgangFahl commented 2 years ago

This issue has done the steps in the script as outlined in #562.

./qlever -e

operating system
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:    20.04
Codename:   focal
memory
              total        used        free      shared  buff/cache   available
Mem:           62Gi       1,4Gi       440Mi        30Mi        60Gi        60Gi
Swap:          18Gi       2,0Mi        18Gi
diskspace

/dev/sdd1      7751408932 2944647700 4416043588  41% ...

The steps

./qlever --clone
./qlever --clone
./qlever --wikidata_download 

have been successful as outlined in https://wiki.bitplan.com/index.php/WikiData_Import_2022-01-29.

./qlever -wi

fails with the message:

IndexBuilderMain: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by IndexBuilderMain)
IndexBuilderMain: /lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.13' not found (required by IndexBuilderMain)

i assume the docker build should might need to use the older build file - might that be the culprit?

WolfgangFahl commented 2 years ago

Same result with other docker file i did:

docker rmi qlever
wf@merkur:/hd/jurob/qlever$ rcsdiff qlever
===================================================================
RCS file: RCS/qlever,v
retrieving revision 1.4
diff -r1.4 qlever
82c82,83
<      date;docker build -t qlever .;date
---
>      # date;docker build -t qlever .;date
>      date;docker build --file Dockerfiles/Dockerfile.Ubuntu20.04 -t qlever .;date
./qlever -b
# finished after 1 min
./qlever -wi
IndexBuilderMain: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by IndexBuilderMain)
IndexBuilderMain: /lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.13' not found (required by IndexBuilderMain)
w
WolfgangFahl commented 2 years ago
docker --version
Docker version 19.03.13, build 4484c46d9d  

might have to be part of the qlever -e command

hannahbast commented 2 years ago

Can you please try your index build with the image from https://hub.docker.com/r/adfreiburg/qlever

Just do docker pull adfreiburg/qlever and in your function wikidata_index() replace --entrypoint bash qlever by --entrypoint bash adfreiburg/qlever

PS: Your script works just fine for me when I try it on one of our machines.

WolfgangFahl commented 2 years ago
./qlever -p
pulling qlever docker image started at So 30. Jan 08:22:59 CET 2022
Using default tag: latest
latest: Pulling from adfreiburg/qlever
ae6e1b672be6: Pull complete 
122115daf864: Pull complete 
772f7cd99382: Pull complete 
6e8537821fa1: Pull complete 
4b0f12be123f: Pull complete 
782a941aa682: Pull complete 
1358d001544d: Pull complete 
9ca319d76ec5: Pull complete 
Digest: sha256:036bde7c6e675c3cf34cfa3f87513af8a43dbd83be736ca83f21c244dfee91f2
Status: Downloaded newer image for adfreiburg/qlever:latest
docker.io/adfreiburg/qlever:latest
pulling qlever docker image finished at So 30. Jan 08:23:59 CET 2022 after 60 seconds
WolfgangFahl commented 2 years ago
./qlever -aw
operating system
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:    20.04
Codename:   focal
docker version
Docker version 19.03.13, build 4484c46d9d
memory
              total        used        free      shared  buff/cache   available
Mem:           62Gi       1,3Gi        51Gi        30Mi        10Gi        60Gi
Swap:          18Gi          0B        18Gi
diskspace
/dev/sdb1        92G   54G   33G  63% /
tmpfs            32G     0   32G   0% /dev/shm
/dev/sdc1       1,8T  1,7T   96G  95% /hd/menki
/dev/sda1       5,5T  3,0T  2,2T  59% /hd/xatu
/dev/sdd1       7,3T  2,8T  4,2T  41% /hd/jurob
pulling qlever docker image started at So 30. Jan 08:26:32 CET 2022
Using default tag: latest
latest: Pulling from adfreiburg/qlever
Digest: sha256:036bde7c6e675c3cf34cfa3f87513af8a43dbd83be736ca83f21c244dfee91f2
Status: Image is up to date for adfreiburg/qlever:latest
docker.io/adfreiburg/qlever:latest
pulling qlever docker image finished at So 30. Jan 08:26:33 CET 2022 after 1 seconds
wikidata dump already downloaded
creating wikidata index started at So 30. Jan 08:26:33 CET 2022
2022-01-30 07:26:34.594 - INFO:  QLever IndexBuilder, compiled on Jan 29 2022 20:19:12
2022-01-30 07:26:34.596 - INFO:  You specified the input format: TTL
2022-01-30 07:26:34.597 - INFO:  You specified "locale = en_US" and "ignore-punctuation = 1"
2022-01-30 07:26:34.597 - INFO:  You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files (see qlever/docs on GitHub)
2022-01-30 07:26:34.597 - INFO:  You specified "num-triples-per-partial-vocab = 50,000,000", choose a lower value if the index builder runs out of memory
2022-01-30 07:26:34.597 - INFO:  Processing input triples from /dev/stdin ..
..

I assume i'll have to wait ~20 hours now. Hopefully the job finished before the machines switches itself off tommorrow at 5am - this machine useally only runs for backup for some 30 mins every night.

WolfgangFahl commented 2 years ago

The server barely responds any more. A ssh login takes about a minute ... Load average is at 24. The full 64 GB + some swap is in use. Is there any chance at all with 64 G RAM?

WolfgangFahl commented 2 years ago

I get

2022-01-30 07:26:34.597 - INFO:  Processing input triples from /dev/stdin ...
creating wikidata index finished at So 30. Jan 08:39:19 CET 2022 after 766 seconds

but no success or failure message. a .disk file with a size of 2.7 GB was created:

-rw-r----- 1 wf docker 1048576000000 Jan 30 08:37 wikidata-stxxl.disk
du -sm wikidata-stxxl.disk 
2673    wikidata-stxxl.disk
WolfgangFahl commented 2 years ago

I now restarted with a 5 x lower

You specified "num-triples-per-partial-vocab = 10,000,000", choose a lower value if the index builder runs out of memory
WolfgangFahl commented 2 years ago

This time things look better:

2022-01-30 07:51:06.170 - INFO:  Processing input triples from /dev/stdin ...
2022-01-30 07:56:38.806 - INFO:  Input triples processed: 100,000,000
2022-01-30 08:01:38.844 - INFO:  Input triples processed: 200,000,000
2022-01-30 08:06:15.734 - INFO:  Input triples processed: 300,000,000
2022-01-30 08:11:00.233 - INFO:  Input triples processed: 400,000,000
WolfgangFahl commented 2 years ago

memory usage is at 12-17 GB as of 2 billion indexed triples

hannahbast commented 2 years ago

There is one more (theoretical) problem with your script, since your input stream consists of several files concatenated, each with prefixes at the beginning.

Namely, QLever currently parses the input stream in parallel and assumes that all relevant PREFIX declarations come in the beginning. For the two Wikidata files, this works because they both contain the complete set of Wikidata PREFIXes (30 of them) at the beginning. For the general case, you need something like this. Also note the use of lbzcat: it's like bzcat, but works in parallel and is hence faster.

( for TTL in *.ttl.bz2; do lbzcat $TTL | head -1000 | grep ^@prefix; done | sort -u && for TTL in *.ttl.bz2; do lbzcat $TTL | grep -v ^@prefix; done )
WolfgangFahl commented 2 years ago

@hannahbast thx for the hint. Besides wikidata there are some datasets that i also imported to Jena which i might need for my work such as https://data.dnb.de/opendata/ - I'll try those as soon as i get the wikidata import to a success.

hannahbast commented 2 years ago

@hannahbast thx for the hint. Besides wikidata there are some datasets that i also imported to Jena which i might need for my work such as https://data.dnb.de/opendata/ - I'll try those as soon as i get the wikidata import to a success.

As far as I can see, they are all small. Do you want to set up separate instances for those or do you want to index them together with Wikidata, in one large instance?

By the way, it looks to me like you are also interested in text from Wikipedia. QLever offers several possibilities for that. One is to simply add one triple per Wikipedia article, linking it to the corresponding Wikidata entity (or several triples if you want to break up the Wikipedia articles into paragraphs). Another is a deeper link between the two, which allows you to search for co-occurrence of entities from Wikidata (constrained by an arbitrary SPARQL query) with certain words in Wikipedia. We call this SPARQL+Text search. Here is an example query: https://qlever.cs.uni-freiburg.de/wikidata-test/dYoK2J

WolfgangFahl commented 2 years ago

Unfortunately i did not read about the "stay with the config" in time and started a try with 30,000,000 as a setting on the Ubuntu native machine which failed after a few hours with no further notice - i assume memory errors are not caught. The memory need was at some 50 GB in that config. So I am going back to the 10,000,000 setting and retry.

2022-01-30 19:42:20.560 - INFO:  Input triples processed: 6,200,000,000
WolfgangFahl commented 2 years ago

Similar outcome on the other (MacOs) machine:

./qlever -e
operating system
ProductName:    Mac OS X
ProductVersion: 10.13.6
BuildVersion:   17G14033
docker version
Docker version 19.03.13, build 4484c46d9d
memory
PhysMem: 64G
diskspace
/dev/disk1s2   5.5Ti  4.0Ti  1.4Ti    74%   28267 4294939012    0%   /Volumes/Quaxo

./qlever -wa
...
2022-01-30 22:52:04.121 - INFO:  Input triples processed: 4,200,000,000
creating wikidata index finished at Mon Jan 31 00:13:57 CET 2022 after 35038 seconds

some kind of abort without a hint on what the problem is - i assume we should see more than 13,000,000,000 triples to be processed in 5 phases.

WolfgangFahl commented 2 years ago

ulimit wasn't set on this machine yet. I assume on the MacOS machine it's better to retry with a VmWare ubuntu and a physical disk connection. I'll need some time to set such an environment up.

WolfgangFahl commented 2 years ago

Version 1.20 of the script shows that the ulimit command doesn't work as expected in a native docker environment. In the VMWare env there are performance issues with the disk access. I'll report on this in the context of the hardware requirements discussion.

qlever version : 1.20 $ : 2022/02/02 06:13:51 $
needed software
docker → /usr/local/bin/docker ✅
top → /usr/bin/top ✅
df → /bin/df ✅
jq → /opt/local/bin/jq ✅
sw_vers → /usr/bin/sw_vers ✅
operating system
ProductName:    Mac OS X
ProductVersion: 10.13.6
BuildVersion:   17G14033
docker version
Docker version 19.03.13, build 4484c46d9d
memory
PhysMem: 64G
diskspace
/dev/disk0s2   446Gi  374Gi   72Gi    84% 2981978 4291985301    0%   /
/dev/disk2s2   3.6Ti  2.1Ti  1.5Ti    59%     360 4294966919    0%   /Volumes/Owei
/dev/disk1s2   5.5Ti  5.0Ti  494Gi    92%   30591 4294936688    0%   /Volumes/Quaxo
./qlever: line 159: ulimit: open files: cannot modify limit: Invalid argument
soft ulimit for files
256
WolfgangFahl commented 2 years ago

On a second native Ubuntu machine with only 32 GB RAM the script fails with an error message:

qlever version : 1.20 $ : 2022/02/02 06:13:51 $
needed software
docker → /usr/bin/docker ✅
top → /usr/bin/top ✅
df → /usr/bin/df ✅
jq → /usr/bin/jq ✅
lsb_release → /usr/bin/lsb_release ✅
free → /usr/bin/free ✅
operating system
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:    20.04
Codename:   focal
docker version
Docker version 20.10.7, build 20.10.7-0ubuntu5~20.04.2
memory
              total        used        free      shared  buff/cache   available
Mem:           31Gi       534Mi        29Gi        40Mi       1,2Gi        30Gi
Swap:         2,0Gi          0B       2,0Gi
diskspace
/dev/sda2       228G   34G  184G  16% /
tmpfs            16G     0   16G   0% /dev/shm
/dev/sda1       511M  5,3M  506M   2% /boot/efi
/dev/sdd1       3,6T  2,0T  1,5T  59% /hd/riakob
/dev/sdc1       1,8T  1,8T   23G  99% /hd/wendi
soft ulimit for files
1048576
clone of clever-code already available
pulling qlever docker image started at Mi 2. Feb 07:42:32 CET 2022
Using default tag: latest
latest: Pulling from adfreiburg/qlever
Digest: sha256:036bde7c6e675c3cf34cfa3f87513af8a43dbd83be736ca83f21c244dfee91f2
Status: Image is up to date for adfreiburg/qlever:latest
docker.io/adfreiburg/qlever:latest
pulling qlever docker image finished at Mi 2. Feb 07:42:34 CET 2022 after 2 seconds
wikidata dump already downloaded
creating wikidata index started at Mi 2. Feb 07:42:34 CET 2022
2022-02-02 06:42:35.645 - INFO:  QLever IndexBuilder, compiled on Jan 29 2022 20:19:12
2022-02-02 06:42:35.664 - INFO:  You specified the input format: TTL
2022-02-02 06:42:35.666 - ERROR: ASSERT FAILED (f.is_open(); in ../src/index/Index.cpp, line 1358, function void Index::initializeVocabularySettingsBuild() [with Parser = TurtleParserAuto])
creating wikidata index finished at Mi 2. Feb 07:42:35 CET 2022 after 1 seconds
joka921 commented 2 years ago

Ok,

  1. This error message is bad, this is somewhere on our long list of TODOs
  2. This error means, that QLever could not open the settings file that was passed to the IndexBuilderMain via the -s option. It might not be present, or it might not be readable by the user running the IndexBuilderMain.
WolfgangFahl commented 2 years ago

Indeed i had to fix the script to do the checking for the availability of the json settings file separately from the check out to avoid this error. It is also the reason for my intention to use jq to be able to adapt the settings via a command line parameter in the future.

WolfgangFahl commented 2 years ago

Currently there are two index processes running on two of my machines that have not failed yet.

  1. on a 32 GB Ubuntu machine with a 4 TB SSD the indexer is at 14.3 billion triples in the first phase after running for 10 hours.
  2. on a 64 GB Ubuntu machine the "Triple conversion" phase is at 3.3 billion triples after some 49 hours.

I intend to move the 4 TB SSD to the 64 GB MacOS machine and if 32 GB RAM should suffice i'd have 3 machines to work with. If the base hardware and software needs are higher i'll have to get a new main board and RAM for my personal environment and ask for better hardware at my lab to later do the runs as part of my research.

WolfgangFahl commented 2 years ago

https://wiki.bitplan.com/index.php/WikiData_Import_2022-01-29 shows the final success after switching to a 128 GB RAM server

WolfgangFahl commented 2 years ago

https://wiki.bitplan.com/index.php/WikiData_Import_2022-05-22 ended with the prefix tree generation. The script ended with the message:

creating wikidata index finished at Di 24. Mai 03:16:02 CEST 2022 after 44625 seconds

and the files:

ls -l 
total 3630709816
-rwx------  1 wf  staff            911 22 Mai 17:47 Qleverfile
drwx------  1 wf  staff          16384 23 Mai 08:07 RCS
-rwx------  1 wf  staff    94653250500 19 Mai 07:35 latest-all.ttl.bz2
-rwx------  1 wf  staff      327629685 21 Mai 01:28 latest-lexemes.ttl.bz2
-rwx------  1 wf  staff  1772814860288 24 Mai 03:01 wikidata-stxxl.disk
-rwx------  1 wf  staff          38003 24 Mai 03:15 wikidata.index-log.txt
-rwx------  1 wf  staff             40 23 Mai 14:52 wikidata.settings.json
-rwx------  1 wf  staff   197180847112 23 Mai 22:25 wikidata.tmp.for-prefix-compression..vocabulary.internal
-rwx------  1 wf  staff   219462801911 24 Mai 00:59 wikidata.vocabulary.internal

it seems no prefix compression has been done - is that a separate step now?

qlever index-stats

This is the "qlever" script, call without argument for help

Executing "index-stats":

readarray -t T < <(sed -En '/INFO:  (Processing|Done, total|Converting triples|Creating|Index build|Text index build)/p' wikidata.index-log.txt | cut -c1-19)

Missing key lines in "wikidata.index-log.txt"
hannahbast commented 2 years ago

@WolfgangFahl How much main memory did that machine have? When the indexer ends prematurely, it's almost always because it runs out of memory (if the operating system's out-of-memory-manager killed the process, that can happen without an error message) or because of a too low number for ulimit -Sn (then you get an error message which says something in that vein).

Note that the qlever script now supports setting both num-triples-per-batch and STXXL_MEMORY_GB in the Qleverfile. However, QLever's defaults (10000000 = 10 million for the former and 10 GB for the latter) are fine for Wikidata on a machine with at least 64 GB of RAM (which isn't much for nowadays standards).

WolfgangFahl commented 2 years ago

The machine had 128 GB RAM and was the same that successfully ran the attempt of January. In fact the January attempt indexed on a 64 GB machine. IMHO it is a bug if the indexer silently crashes without notice when running out of memory and we'll not no the reason for crashes if the software is not robust enough to monitor it's own state and know why a failure is imminent or has happened.

After 5 attempts in 5 month of which 4 where unsuccessful i don't have a running qlever wikidata index since the successful index of January is unusable by the current state of the software and was already deleted. For me this is all a bit frustrating given the effort each attempt needs. I am hoping for a version of the software and instructions that will allow for a succesful attempt again.

joka921 commented 2 years ago

Hi, There is only 1 element in the external vocabulary, so the prefix compression runs out of memory. I agree that we could try to handle this more gently. The problem in your concrete case seems to be that the wikidata.settings.json seems to have the wrong contents. The "official" one sets appropiate externalization and also a default locale (the locale is not super important, but it helped me confim my theory). Can you send us the wikidata.settings.json you used?

@hannahbast Is this File altered by the Qleverfile and how is it possible to lose the default externalization there?

hannahbast commented 2 years ago

@WolfgangFahl @joka921 The 'qlever' script also supports a variable for the settings.json. If you do . qlever wikidata in a directory without a Qleverfile, you get a Qleverfile with, in particular, the following line:

SETTINGS_JSON = '{ "languages-internal": ["en"], "prefixes-external": [ "<http://www.wikidata.org/entity/statement", "<http://www.wikidata.org/value", "<http://www.wikidata.org/reference" ], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 25000000 }'

With the qlever script, building a Wikidata index from a fresh Wikidata dump is really easy now. All you have to do is something like this (which doesn't need any supervision, you just type it and wait for a day, I have done it many times already):

mkdir wikidata-latest && cd wikidata-latest
. qlever wikidata
qlever download-data index
hannahbast commented 2 years ago

@joka921 Can you briefly explain again why it's hard for a C++ program to avoid being killed by the out-of-memory manager? I think that many people imagine that it's just a matter of asking "how much memory is left on the machine" and if not enough memory is left, adjusting accordingly.

Is the problem, in a nutshell, that there is no reliable way to get a truthful answer to the question "how much memory is left on the machine"?

hannahbast commented 2 years ago

Maybe Wolfang was still using the old instructions on https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md ? They do include a correct wikidata.settings.json, but with a rather large batch size of 50M. I just created a PR which changes that to 10M and includes the ulimit -Sn 1048576 in the command line for building the index. Please check and approve.

joka921 commented 2 years ago

Okay,

  1. The problem especially on Linux is the socalled "overcommit" of Memory aka "malloc never fails". If you allocate too much memory, the OS will not complain but give you a valid pointer to some (virtual!) address. Only once you actually write to too many of the memory pages you have allocated, the OS at som point says "This is too much memory that is ACTUALLY used" (not only allocated), I will kill you now" [alternatively the system starts swapping and gets unusable slow, if the OOM killer is not aggressive enough). This behavior can be deactivated somehwere in the kernel, (You then should get nullptr from malloc or std::bad_alloc from C++-style allocations, but this is not typical for performance reasons.

  2. Other software also has this issue, Virtuoso also silently crashes if you allow it to use more memory than you have.

  3. Wolfgang somewhat has a point here, the reporting of the errors could be much better in the IndexBuilder.

  4. The index with Wolfgang's settings would have been unusable anyway because the internal vocabulary is too large (216 GB if I looked correctly) because of the missing externalization.

WolfgangFahl commented 2 years ago

Thank you for the lively discussion. Please reply at https://github.com/ad-freiburg/qlever/discussions/668 for how I should proceed.

WolfgangFahl commented 2 years ago

https://wiki.bitplan.com/index.php/WikiData_Import_2022-06-24 reports the latest attempt.

The log file ends with

2022-06-25 09:42:57.517 - INFO: Creating a pair of index permutations ...

https://wiki.bitplan.com/index.php/WikiData_Import_2022-06-24#Resulting_files shows the results.

I fear this is a failed attempt again.

What should be in the log on success and how should the resulting files look like?

joka921 commented 2 years ago

Thanks for the Info and the log. You could use this index with the "only load PSO and POS" option, then you can't formulate queries where the predicate is a variable, please try this out.

Can you quickly check how much space is left on you hard disk/SSD? We have to figure out, why this attempt failed (RAM should not be the issue in this last phase).

It seems like the stxxl disk already has 1.7TB if I read that correctly. In general QLever requires quite some space on the hard disk while building the index.

But on the bright side: It was almost finished and only took a single day and not 4:)

WolfgangFahl commented 2 years ago

Indeed the harddisk is full:

df
/dev/sda1      3844660232 3649293016         0 100% /hd/seel

so i assume a check for minimum disk space is due before starting the procedure. I wonder why the february attempt was successful - was the increase of triples in wikidata itself from february to may the cause or doe the files created by qlever need more space by now? I could happily run the indexer on a 10 TB rotating disk if that is the better option...

joka921 commented 2 years ago
  1. The 10TB rotating disk sounds good, That is also what we do.
  2. The memory consumption of the IndexBuilder went up, because we traded it for time (1 day vs. 4 days in your case). Note that the additional space is only temporary space that is not needed anymore after the IndexBuilder has finished.
  3. I understand your wish for a precheck. I gave you something like that in person ("only 2 TB of free disk space, that might become an issue"). Although this is harder than it seems, because the space requirement depends (mostly) on the number of triples, which is not known in advance. What would work is that you say "I know that Wikidata has about 20B triples" and THEN the system might say "2TB is probably too low"
hannahbast commented 2 years ago

4 TB is more than enough to build Wikidata using QLever. The question is: how many TB were free on your machine, to begin with?

A general remark: It is certainly interesting to have your experiences with building Wikidata on machines with very tight resources. And QLever is indeed the engine of choice for that. But to put things into perspective, in all the discussions about the new backend for the Wikidata Query Service, folks are considering huge servers and even server farms to be able to cope with the sheer size of the data. You are on the opposite side of that spectrum.

hannahbast commented 2 years ago

Moved this to https://github.com/ad-freiburg/qlever/discussions/668, since a large part of the problems are specific to the setup of a particular user.