Closed WolfgangFahl closed 2 years ago
Same result with other docker file i did:
docker rmi qlever
wf@merkur:/hd/jurob/qlever$ rcsdiff qlever
===================================================================
RCS file: RCS/qlever,v
retrieving revision 1.4
diff -r1.4 qlever
82c82,83
< date;docker build -t qlever .;date
---
> # date;docker build -t qlever .;date
> date;docker build --file Dockerfiles/Dockerfile.Ubuntu20.04 -t qlever .;date
./qlever -b
# finished after 1 min
./qlever -wi
IndexBuilderMain: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by IndexBuilderMain)
IndexBuilderMain: /lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.13' not found (required by IndexBuilderMain)
w
docker --version
Docker version 19.03.13, build 4484c46d9d
might have to be part of the qlever -e command
Can you please try your index build with the image from https://hub.docker.com/r/adfreiburg/qlever
Just do docker pull adfreiburg/qlever
and in your function wikidata_index()
replace --entrypoint bash qlever
by --entrypoint bash adfreiburg/qlever
PS: Your script works just fine for me when I try it on one of our machines.
./qlever -p
pulling qlever docker image started at So 30. Jan 08:22:59 CET 2022
Using default tag: latest
latest: Pulling from adfreiburg/qlever
ae6e1b672be6: Pull complete
122115daf864: Pull complete
772f7cd99382: Pull complete
6e8537821fa1: Pull complete
4b0f12be123f: Pull complete
782a941aa682: Pull complete
1358d001544d: Pull complete
9ca319d76ec5: Pull complete
Digest: sha256:036bde7c6e675c3cf34cfa3f87513af8a43dbd83be736ca83f21c244dfee91f2
Status: Downloaded newer image for adfreiburg/qlever:latest
docker.io/adfreiburg/qlever:latest
pulling qlever docker image finished at So 30. Jan 08:23:59 CET 2022 after 60 seconds
./qlever -aw
operating system
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal
docker version
Docker version 19.03.13, build 4484c46d9d
memory
total used free shared buff/cache available
Mem: 62Gi 1,3Gi 51Gi 30Mi 10Gi 60Gi
Swap: 18Gi 0B 18Gi
diskspace
/dev/sdb1 92G 54G 33G 63% /
tmpfs 32G 0 32G 0% /dev/shm
/dev/sdc1 1,8T 1,7T 96G 95% /hd/menki
/dev/sda1 5,5T 3,0T 2,2T 59% /hd/xatu
/dev/sdd1 7,3T 2,8T 4,2T 41% /hd/jurob
pulling qlever docker image started at So 30. Jan 08:26:32 CET 2022
Using default tag: latest
latest: Pulling from adfreiburg/qlever
Digest: sha256:036bde7c6e675c3cf34cfa3f87513af8a43dbd83be736ca83f21c244dfee91f2
Status: Image is up to date for adfreiburg/qlever:latest
docker.io/adfreiburg/qlever:latest
pulling qlever docker image finished at So 30. Jan 08:26:33 CET 2022 after 1 seconds
wikidata dump already downloaded
creating wikidata index started at So 30. Jan 08:26:33 CET 2022
2022-01-30 07:26:34.594 - INFO: QLever IndexBuilder, compiled on Jan 29 2022 20:19:12
2022-01-30 07:26:34.596 - INFO: You specified the input format: TTL
2022-01-30 07:26:34.597 - INFO: You specified "locale = en_US" and "ignore-punctuation = 1"
2022-01-30 07:26:34.597 - INFO: You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files (see qlever/docs on GitHub)
2022-01-30 07:26:34.597 - INFO: You specified "num-triples-per-partial-vocab = 50,000,000", choose a lower value if the index builder runs out of memory
2022-01-30 07:26:34.597 - INFO: Processing input triples from /dev/stdin ..
..
I assume i'll have to wait ~20 hours now. Hopefully the job finished before the machines switches itself off tommorrow at 5am - this machine useally only runs for backup for some 30 mins every night.
The server barely responds any more. A ssh login takes about a minute ... Load average is at 24. The full 64 GB + some swap is in use. Is there any chance at all with 64 G RAM?
I get
2022-01-30 07:26:34.597 - INFO: Processing input triples from /dev/stdin ...
creating wikidata index finished at So 30. Jan 08:39:19 CET 2022 after 766 seconds
but no success or failure message. a .disk file with a size of 2.7 GB was created:
-rw-r----- 1 wf docker 1048576000000 Jan 30 08:37 wikidata-stxxl.disk
du -sm wikidata-stxxl.disk
2673 wikidata-stxxl.disk
I now restarted with a 5 x lower
You specified "num-triples-per-partial-vocab = 10,000,000", choose a lower value if the index builder runs out of memory
This time things look better:
2022-01-30 07:51:06.170 - INFO: Processing input triples from /dev/stdin ...
2022-01-30 07:56:38.806 - INFO: Input triples processed: 100,000,000
2022-01-30 08:01:38.844 - INFO: Input triples processed: 200,000,000
2022-01-30 08:06:15.734 - INFO: Input triples processed: 300,000,000
2022-01-30 08:11:00.233 - INFO: Input triples processed: 400,000,000
memory usage is at 12-17 GB as of 2 billion indexed triples
There is one more (theoretical) problem with your script, since your input stream consists of several files concatenated, each with prefixes at the beginning.
Namely, QLever currently parses the input stream in parallel and assumes that all relevant PREFIX declarations come in the beginning. For the two Wikidata files, this works because they both contain the complete set of Wikidata PREFIXes (30 of them) at the beginning. For the general case, you need something like this. Also note the use of lbzcat
: it's like bzcat
, but works in parallel and is hence faster.
( for TTL in *.ttl.bz2; do lbzcat $TTL | head -1000 | grep ^@prefix; done | sort -u && for TTL in *.ttl.bz2; do lbzcat $TTL | grep -v ^@prefix; done )
@hannahbast thx for the hint. Besides wikidata there are some datasets that i also imported to Jena which i might need for my work such as https://data.dnb.de/opendata/ - I'll try those as soon as i get the wikidata import to a success.
@hannahbast thx for the hint. Besides wikidata there are some datasets that i also imported to Jena which i might need for my work such as https://data.dnb.de/opendata/ - I'll try those as soon as i get the wikidata import to a success.
As far as I can see, they are all small. Do you want to set up separate instances for those or do you want to index them together with Wikidata, in one large instance?
By the way, it looks to me like you are also interested in text from Wikipedia. QLever offers several possibilities for that. One is to simply add one triple per Wikipedia article, linking it to the corresponding Wikidata entity (or several triples if you want to break up the Wikipedia articles into paragraphs). Another is a deeper link between the two, which allows you to search for co-occurrence of entities from Wikidata (constrained by an arbitrary SPARQL query) with certain words in Wikipedia. We call this SPARQL+Text search. Here is an example query: https://qlever.cs.uni-freiburg.de/wikidata-test/dYoK2J
Unfortunately i did not read about the "stay with the config" in time and started a try with 30,000,000 as a setting on the Ubuntu native machine which failed after a few hours with no further notice - i assume memory errors are not caught. The memory need was at some 50 GB in that config. So I am going back to the 10,000,000 setting and retry.
2022-01-30 19:42:20.560 - INFO: Input triples processed: 6,200,000,000
Similar outcome on the other (MacOs) machine:
./qlever -e
operating system
ProductName: Mac OS X
ProductVersion: 10.13.6
BuildVersion: 17G14033
docker version
Docker version 19.03.13, build 4484c46d9d
memory
PhysMem: 64G
diskspace
/dev/disk1s2 5.5Ti 4.0Ti 1.4Ti 74% 28267 4294939012 0% /Volumes/Quaxo
./qlever -wa
...
2022-01-30 22:52:04.121 - INFO: Input triples processed: 4,200,000,000
creating wikidata index finished at Mon Jan 31 00:13:57 CET 2022 after 35038 seconds
some kind of abort without a hint on what the problem is - i assume we should see more than 13,000,000,000 triples to be processed in 5 phases.
ulimit wasn't set on this machine yet. I assume on the MacOS machine it's better to retry with a VmWare ubuntu and a physical disk connection. I'll need some time to set such an environment up.
Version 1.20 of the script shows that the ulimit command doesn't work as expected in a native docker environment. In the VMWare env there are performance issues with the disk access. I'll report on this in the context of the hardware requirements discussion.
qlever version : 1.20 $ : 2022/02/02 06:13:51 $
needed software
docker → /usr/local/bin/docker ✅
top → /usr/bin/top ✅
df → /bin/df ✅
jq → /opt/local/bin/jq ✅
sw_vers → /usr/bin/sw_vers ✅
operating system
ProductName: Mac OS X
ProductVersion: 10.13.6
BuildVersion: 17G14033
docker version
Docker version 19.03.13, build 4484c46d9d
memory
PhysMem: 64G
diskspace
/dev/disk0s2 446Gi 374Gi 72Gi 84% 2981978 4291985301 0% /
/dev/disk2s2 3.6Ti 2.1Ti 1.5Ti 59% 360 4294966919 0% /Volumes/Owei
/dev/disk1s2 5.5Ti 5.0Ti 494Gi 92% 30591 4294936688 0% /Volumes/Quaxo
./qlever: line 159: ulimit: open files: cannot modify limit: Invalid argument
soft ulimit for files
256
On a second native Ubuntu machine with only 32 GB RAM the script fails with an error message:
qlever version : 1.20 $ : 2022/02/02 06:13:51 $
needed software
docker → /usr/bin/docker ✅
top → /usr/bin/top ✅
df → /usr/bin/df ✅
jq → /usr/bin/jq ✅
lsb_release → /usr/bin/lsb_release ✅
free → /usr/bin/free ✅
operating system
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal
docker version
Docker version 20.10.7, build 20.10.7-0ubuntu5~20.04.2
memory
total used free shared buff/cache available
Mem: 31Gi 534Mi 29Gi 40Mi 1,2Gi 30Gi
Swap: 2,0Gi 0B 2,0Gi
diskspace
/dev/sda2 228G 34G 184G 16% /
tmpfs 16G 0 16G 0% /dev/shm
/dev/sda1 511M 5,3M 506M 2% /boot/efi
/dev/sdd1 3,6T 2,0T 1,5T 59% /hd/riakob
/dev/sdc1 1,8T 1,8T 23G 99% /hd/wendi
soft ulimit for files
1048576
clone of clever-code already available
pulling qlever docker image started at Mi 2. Feb 07:42:32 CET 2022
Using default tag: latest
latest: Pulling from adfreiburg/qlever
Digest: sha256:036bde7c6e675c3cf34cfa3f87513af8a43dbd83be736ca83f21c244dfee91f2
Status: Image is up to date for adfreiburg/qlever:latest
docker.io/adfreiburg/qlever:latest
pulling qlever docker image finished at Mi 2. Feb 07:42:34 CET 2022 after 2 seconds
wikidata dump already downloaded
creating wikidata index started at Mi 2. Feb 07:42:34 CET 2022
2022-02-02 06:42:35.645 - INFO: QLever IndexBuilder, compiled on Jan 29 2022 20:19:12
2022-02-02 06:42:35.664 - INFO: You specified the input format: TTL
2022-02-02 06:42:35.666 - ERROR: ASSERT FAILED (f.is_open(); in ../src/index/Index.cpp, line 1358, function void Index::initializeVocabularySettingsBuild() [with Parser = TurtleParserAuto])
creating wikidata index finished at Mi 2. Feb 07:42:35 CET 2022 after 1 seconds
Ok,
IndexBuilderMain
via the -s
option.
It might not be present, or it might not be readable by the user running the IndexBuilderMain
.Indeed i had to fix the script to do the checking for the availability of the json settings file separately from the check out to avoid this error. It is also the reason for my intention to use jq to be able to adapt the settings via a command line parameter in the future.
Currently there are two index processes running on two of my machines that have not failed yet.
I intend to move the 4 TB SSD to the 64 GB MacOS machine and if 32 GB RAM should suffice i'd have 3 machines to work with. If the base hardware and software needs are higher i'll have to get a new main board and RAM for my personal environment and ask for better hardware at my lab to later do the runs as part of my research.
https://wiki.bitplan.com/index.php/WikiData_Import_2022-01-29 shows the final success after switching to a 128 GB RAM server
https://wiki.bitplan.com/index.php/WikiData_Import_2022-05-22 ended with the prefix tree generation. The script ended with the message:
creating wikidata index finished at Di 24. Mai 03:16:02 CEST 2022 after 44625 seconds
and the files:
ls -l
total 3630709816
-rwx------ 1 wf staff 911 22 Mai 17:47 Qleverfile
drwx------ 1 wf staff 16384 23 Mai 08:07 RCS
-rwx------ 1 wf staff 94653250500 19 Mai 07:35 latest-all.ttl.bz2
-rwx------ 1 wf staff 327629685 21 Mai 01:28 latest-lexemes.ttl.bz2
-rwx------ 1 wf staff 1772814860288 24 Mai 03:01 wikidata-stxxl.disk
-rwx------ 1 wf staff 38003 24 Mai 03:15 wikidata.index-log.txt
-rwx------ 1 wf staff 40 23 Mai 14:52 wikidata.settings.json
-rwx------ 1 wf staff 197180847112 23 Mai 22:25 wikidata.tmp.for-prefix-compression..vocabulary.internal
-rwx------ 1 wf staff 219462801911 24 Mai 00:59 wikidata.vocabulary.internal
it seems no prefix compression has been done - is that a separate step now?
qlever index-stats
This is the "qlever" script, call without argument for help
Executing "index-stats":
readarray -t T < <(sed -En '/INFO: (Processing|Done, total|Converting triples|Creating|Index build|Text index build)/p' wikidata.index-log.txt | cut -c1-19)
Missing key lines in "wikidata.index-log.txt"
@WolfgangFahl How much main memory did that machine have? When the indexer ends prematurely, it's almost always because it runs out of memory (if the operating system's out-of-memory-manager killed the process, that can happen without an error message) or because of a too low number for ulimit -Sn
(then you get an error message which says something in that vein).
Note that the qlever
script now supports setting both num-triples-per-batch
and STXXL_MEMORY_GB
in the Qleverfile
. However, QLever's defaults (10000000
= 10 million for the former and 10 GB
for the latter) are fine for Wikidata on a machine with at least 64 GB of RAM (which isn't much for nowadays standards).
The machine had 128 GB RAM and was the same that successfully ran the attempt of January. In fact the January attempt indexed on a 64 GB machine. IMHO it is a bug if the indexer silently crashes without notice when running out of memory and we'll not no the reason for crashes if the software is not robust enough to monitor it's own state and know why a failure is imminent or has happened.
After 5 attempts in 5 month of which 4 where unsuccessful i don't have a running qlever wikidata index since the successful index of January is unusable by the current state of the software and was already deleted. For me this is all a bit frustrating given the effort each attempt needs. I am hoping for a version of the software and instructions that will allow for a succesful attempt again.
Hi,
There is only 1 element in the external vocabulary, so the prefix compression runs out of memory. I agree that we could try to handle this more gently.
The problem in your concrete case seems to be that the wikidata.settings.json
seems to have the wrong contents. The "official" one sets appropiate externalization and also a default locale (the locale is not super important, but it helped me confim my theory).
Can you send us the wikidata.settings.json
you used?
@hannahbast Is this File altered by the Qleverfile
and how is it possible to lose the default externalization there?
@WolfgangFahl @joka921 The 'qlever' script also supports a variable for the settings.json
. If you do . qlever wikidata
in a directory without a Qleverfile, you get a Qleverfile with, in particular, the following line:
SETTINGS_JSON = '{ "languages-internal": ["en"], "prefixes-external": [ "<http://www.wikidata.org/entity/statement", "<http://www.wikidata.org/value", "<http://www.wikidata.org/reference" ], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 25000000 }'
With the qlever
script, building a Wikidata index from a fresh Wikidata dump is really easy now. All you have to do is something like this (which doesn't need any supervision, you just type it and wait for a day, I have done it many times already):
mkdir wikidata-latest && cd wikidata-latest
. qlever wikidata
qlever download-data index
@joka921 Can you briefly explain again why it's hard for a C++ program to avoid being killed by the out-of-memory manager? I think that many people imagine that it's just a matter of asking "how much memory is left on the machine" and if not enough memory is left, adjusting accordingly.
Is the problem, in a nutshell, that there is no reliable way to get a truthful answer to the question "how much memory is left on the machine"?
Maybe Wolfang was still using the old instructions on https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md ? They do include a correct wikidata.settings.json
, but with a rather large batch size of 50M. I just created a PR which changes that to 10M and includes the ulimit -Sn 1048576
in the command line for building the index. Please check and approve.
Okay,
The problem especially on Linux is the socalled "overcommit" of Memory aka "malloc never fails". If you allocate too much memory, the OS will not complain but give you a valid pointer to some (virtual!) address. Only once you actually write to too many of the memory pages you have allocated, the OS at som point says "This is too much memory that is ACTUALLY used" (not only allocated), I will kill you now" [alternatively the system starts swapping and gets unusable slow, if the OOM killer is not aggressive enough). This behavior can be deactivated somehwere in the kernel, (You then should get nullptr from malloc or std::bad_alloc
from C++-style allocations, but this is not typical for performance reasons.
Other software also has this issue, Virtuoso also silently crashes if you allow it to use more memory than you have.
Wolfgang somewhat has a point here, the reporting of the errors could be much better in the IndexBuilder.
The index with Wolfgang's settings would have been unusable anyway because the internal vocabulary is too large (216 GB if I looked correctly) because of the missing externalization.
Thank you for the lively discussion. Please reply at https://github.com/ad-freiburg/qlever/discussions/668 for how I should proceed.
https://wiki.bitplan.com/index.php/WikiData_Import_2022-06-24 reports the latest attempt.
The log file ends with
2022-06-25 09:42:57.517 - INFO: Creating a pair of index permutations ...
https://wiki.bitplan.com/index.php/WikiData_Import_2022-06-24#Resulting_files shows the results.
I fear this is a failed attempt again.
What should be in the log on success and how should the resulting files look like?
Thanks for the Info and the log. You could use this index with the "only load PSO and POS" option, then you can't formulate queries where the predicate is a variable, please try this out.
Can you quickly check how much space is left on you hard disk/SSD? We have to figure out, why this attempt failed (RAM should not be the issue in this last phase).
It seems like the stxxl disk already has 1.7TB if I read that correctly. In general QLever requires quite some space on the hard disk while building the index.
But on the bright side: It was almost finished and only took a single day and not 4:)
Indeed the harddisk is full:
df
/dev/sda1 3844660232 3649293016 0 100% /hd/seel
so i assume a check for minimum disk space is due before starting the procedure. I wonder why the february attempt was successful - was the increase of triples in wikidata itself from february to may the cause or doe the files created by qlever need more space by now? I could happily run the indexer on a 10 TB rotating disk if that is the better option...
4 TB is more than enough to build Wikidata using QLever. The question is: how many TB were free on your machine, to begin with?
A general remark: It is certainly interesting to have your experiences with building Wikidata on machines with very tight resources. And QLever is indeed the engine of choice for that. But to put things into perspective, in all the discussions about the new backend for the Wikidata Query Service, folks are considering huge servers and even server farms to be able to cope with the sheer size of the data. You are on the opposite side of that spectrum.
Moved this to https://github.com/ad-freiburg/qlever/discussions/668, since a large part of the problems are specific to the setup of a particular user.
This issue has done the steps in the script as outlined in #562.
./qlever -e
The steps
have been successful as outlined in https://wiki.bitplan.com/index.php/WikiData_Import_2022-01-29.
fails with the message:
i assume the docker build should might need to use the older build file - might that be the culprit?