Closed WolfgangFahl closed 2 years ago
The import seems to have a linear triple/time behavior: see Progress2022-01-30.xlsx
Time | Δs | ∑s | Triples | Δ million/s | ØΔ million/s | Todo mill | Togo s | ETA |
---|---|---|---|---|---|---|---|---|
13602102648 | ||||||||
30.01.22 07:51 | ||||||||
30.01.22 07:56 | 332 | 332 | 100000000 | 0,301 | 0,301 | 13502 | 44827 | 30.01.22 20:23 |
30.01.22 08:01 | 300 | 632 | 200000000 | 0,333 | 0,316 | 13402 | 42351 | 30.01.22 19:47 |
30.01.22 08:06 | 277 | 909 | 300000000 | 0,361 | 0,330 | 13302 | 40305 | 30.01.22 19:18 |
30.01.22 08:11 | 285 | 1194 | 400000000 | 0,351 | 0,335 | 13202 | 39408 | 30.01.22 19:07 |
30.01.22 08:15 | 294 | 1488 | 500000000 | 0,340 | 0,336 | 13102 | 38992 | 30.01.22 19:05 |
30.01.22 08:20 | 296 | 1784 | 600000000 | 0,338 | 0,336 | 13002 | 38660 | 30.01.22 19:05 |
30.01.22 08:25 | 302 | 2086 | 700000000 | 0,331 | 0,336 | 12902 | 38448 | 30.01.22 19:06 |
30.01.22 08:30 | 298 | 2384 | 800000000 | 0,336 | 0,336 | 12802 | 38150 | 30.01.22 19:06 |
30.01.22 08:35 | 295 | 2679 | 900000000 | 0,339 | 0,336 | 12702 | 37810 | 30.01.22 19:05 |
30.01.22 08:40 | 296 | 2975 | 1000000000 | 0,338 | 0,336 | 12602 | 37491 | 30.01.22 19:05 |
30.01.22 08:45 | 292 | 3267 | 1100000000 | 0,342 | 0,337 | 12502 | 37131 | 30.01.22 19:04 |
30.01.22 08:50 | 296 | 3563 | 1200000000 | 0,338 | 0,337 | 12402 | 36824 | 30.01.22 19:04 |
30.01.22 08:55 | 310 | 3873 | 1300000000 | 0,323 | 0,336 | 12302 | 36651 | 30.01.22 19:06 |
30.01.22 09:00 | 305 | 4178 | 1400000000 | 0,328 | 0,335 | 12202 | 36415 | 30.01.22 19:07 |
30.01.22 09:05 | 310 | 4488 | 1500000000 | 0,323 | 0,334 | 12102 | 36209 | 30.01.22 19:09 |
30.01.22 09:11 | 325 | 4813 | 1600000000 | 0,308 | 0,332 | 12002 | 36104 | 30.01.22 19:13 |
30.01.22 09:16 | 329 | 5142 | 1700000000 | 0,304 | 0,331 | 11902 | 36000 | 30.01.22 19:16 |
30.01.22 09:21 | 308 | 5450 | 1800000000 | 0,325 | 0,330 | 11802 | 35734 | 30.01.22 19:17 |
30.01.22 09:27 | 330 | 5780 | 1900000000 | 0,303 | 0,329 | 11702 | 35599 | 30.01.22 19:20 |
30.01.22 09:32 | 309 | 6089 | 2000000000 | 0,324 | 0,328 | 11602 | 35323 | 30.01.22 19:21 |
30.01.22 09:37 | 313 | 6402 | 2100000000 | 0,319 | 0,328 | 11502 | 35065 | 30.01.22 19:22 |
30.01.22 09:42 | 292 | 6694 | 2200000000 | 0,342 | 0,329 | 11402 | 34693 | 30.01.22 19:20 |
30.01.22 09:47 | 299 | 6993 | 2300000000 | 0,334 | 0,329 | 11302 | 34363 | 30.01.22 19:20 |
30.01.22 09:52 | 279 | 7272 | 2400000000 | 0,358 | 0,330 | 11202 | 33942 | 30.01.22 19:18 |
30.01.22 09:57 | 295 | 7567 | 2500000000 | 0,339 | 0,330 | 11102 | 33604 | 30.01.22 19:17 |
Yes, indexing time is more or less linear in the number of input triples. A rule of thumb is 1 billion triples per hour on a (modern) standard PC.
Note that what you are measuring above is just the first phase of the indexing (parsing of the input triples). That should have a speed at around 5 billion triples per hour. There are more phases after that. The whole indexing takes five times as long as the first phase alone.
Sounds reasonable given that i reduced the settings by a factor of 5 to avoid hitting a memory limit. So it seems there is a tradeoff between speed and memory. Memory usage is currently at 12-15 GB so I could increase the settings again to use the available 64 GB better. I think i'll start another try on the Mac OS machine with the updated script where there are 12 cores and 64 GB of RAM available.
There is no need to restart, the batch size only has a relatively small influence on total indexing time. A batch size of 10M is just fine. However, you need to check ulimit -Sn on your machine then (number of files that can be opened simultaneously). The default setting on some machines is just 1024, that's not enough for Wikidata (QLever writes and reads two temporary files per batch). Set it to something like 1048576.
I tried to find out how to modify the ulimit -Sn setting - it looks like that is a kernel parameter and would need a reboot. How do i modify that setting? https://serverfault.com/questions/216656/how-to-set-systemwide-ulimit-on-ubuntu was the article i found and i assume the nofile setting is the necessary change. https://unix.stackexchange.com/questions/8945/how-can-i-increase-open-files-limit-for-all-processes mentions that two files need to be changed.
In the script itself I am going to add
ulimit -Sn 1048576
@WolfgangFahl It's good enough if you change it for the shell in which you are running the index builder, is it not? So adding it to your script is just fine, as long as you change it in the right spot.
@hannahbast the script with the modification seems to work - an attempt of yesterday created more than 1024 files already but was stopped by an unwanted server shutdown that had nothing to do with the qlever environment. This morning i started another try.
For the https://wiki.bitplan.com/index.php/WikiData_Import_2022-03-11 i am using a script to trace the progress with a spreadsheet. On my apple numbers will happily display the formulas
#!/bin/bash
# WF 2022-03-12
# get the relevant log information for the indexer
#
# get relevant lines from the log
#
log() {
local l_logfile="$1"
egrep "Processing input triples|Input triples processed" $l_logfile
}
expected=16979
echo 'day;time;mill triples;duration;mill triples/sec;todo;ETA' > stats.csv
echo ";;$expected;;" >> stats.csv
log qlever-indices/wikidata/wikidata.index-log.txt \
| cut -f1,2,8 -d' ' \
| sed 's/ /;/g' \
| sed 's/ -//g' \
| sed 's/,//g' \
| sed 's/\.[[:digit:]]\+//g' \
| awk '
BEGIN { FS=";"}
{
date=$1
time=$2
triples=$3
if (triples=="from")
printf("%s;%s;;;;\n",date,time)
else {
row=NR+2
printf("%s;%s;%s;=B%d-B$3;=C%d/D%d/3600;=C$2-C%d;=F%d/E%d\n",date,time,triples/1000000,row,row,row,row,row,row)
}
}
' >> stats.csv
cat stats.csv
# open in spreadsheet
open stats.csv
When the other phase will start I am intending to adapt the script
#!/bin/bash
# WF 2022-03-12
# get the relevant log information for the indexer
# $Header: /hd/seel/qlever/RCS/logstats,v 1.5 2022/06/28 05:43:00 wf Exp wf $
logfile=wikidata/wikidata.index-log.txt
stats=stats.csv
echo 'day;time;phase;mill triples;duration;mill triples/h;todo;ETA h' > /tmp/$stats
decimalsep=","
cat $logfile \
| sed 's/ /;/g' \
| sed 's/ -//g' \
| sed 's/,//g' \
| sed 's/\.[[:digit:]]\+//g' \
| awk -v expectedTriples=17500 -v expectedBoM=800 -v expectedUoM=3200 -v expectedConversion=28300 -v expectedWords=800 '
BEGIN {
# Field separator
FS=";"
# double quote
quote="\x22"
result="Index building in progress ..."
}
# default extraction from line
# 2022-05-22 17:48:22.564 ...
{
#print $0
date=$1
time=$2
}
# start of Processing phase
# 2022-05-22 17:48:22.564 - INFO: Processing input triples from /dev/stdin ...
/Processing;input;triples;from/ {
phase="Processing"
printStartPhase(date,time,phase,expectedTriples)
row=3
next
}
# while processing
# 2022-05-23 00:09:50.846 - INFO: Input triples processed: 17,400,000,000
/Input;triples;processed:;/{
triples=$8
next
}
# Start of byte order Merging
# 2022-05-23 00:10:52.614 - INFO: Merging partial vocabularies in byte order (internal only) ...
/Merging;partial;vocabularies;in;byte;order/ {
printrow(date,time,triples,row,phase)
phase="Byte order merging"
printStartPhase(date,time,phase,expectedBoM)
row=5
next
}
/Words;merged:;/ {
triples=$7
next
}
/Words;processed:;/ {
triples=$7
next
}
/Merging;partial;vocabularies;in;Unicode;order/ {
printrow(date,time,triples,row,phase)
phase="Unicode order merging"
printStartPhase(date,time,phase,expectedUoM)
row=7
next
}
/Converting;triples;from;local;/ {
printrow(date,time,triples,row,phase)
phase="Triple conversion"
printStartPhase(date,time,phase,expectedConversion)
row=9
}
/Triples;converted:;/ {
triples=$7
next
}
/Building;/ {
printrow(date,time,triples,row,phase)
phase="Prefix tree"
printStartPhase(date,time,phase,expectedWords)
row=11
next
}
/Computing;maximally/ {
printrow(date,time,triples,row,phase)
phase="Compressing prefixes"
printStartPhase(date,time,phase,expectedTriples)
row=13
triples=expectedTriples*1000000
next
}
/Writing;compressed;vocabulary/ {
printrow(date,time,triples,row,phase)
phase="PSO/POS index pair"
printStartPhase(date,time,phase,expectedTriples)
row=15
triples=expectedTriples*1000000
next
}
/Writing;meta;data;for;PSO/ {
printrow(date,time,triples,row,phase)
phase="SPO/SOP index pair"
printStartPhase(date,time,phase,expectedTriples)
row=17
triples=expectedTriples*1000000
next
}
/Writing;meta;data;for;SPO/ {
printrow(date,time,triples,row,phase)
phase="OSP/OPS index pair"
printStartPhase(date,time,phase,expectedTriples)
row=19
triples=expectedTriples*1000000
next
}
/Index;build;completed/ {
result="Finished Index building successfully✅"
printrow(date,time,triples,row,phase)
printf(";;total;;=SUMME(E$2:E%d)\n",row)
}
# print start of the given phase
function printStartPhase(date,time,phase,expected) {
printf("%s;%s;%s;%d;;;\n",date,time,phase,expected)
}
# print a row for the given phase
function printrow(date,time,triples,row,phase) {
printf("%s;%s;%s;%s;=(A%d+B%d)-(A%d+B%d);%s=Runden(D%d/E%d;0)%s;=D%d-D%d;%s=Runden(G%d/F%d;1)%s\n",date,time,phase,triples/1000000,row,row,row-1,row-1,quote,row,row,quote,row-1,row,quote,row,row,quote)
}
END {
printf(result)
}
' >> /tmp/$stats
cat /tmp/$stats | sed "s/\./$decimalsep/g" > $stats
cat stats.csv
# open in spreadsheet
open stats.csv
Revisiting this after some time. The index building log has been much improved in the meantime, with good progress info. Also the qlever script is available for quite some time now.
From my experience with my https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData attempts i have seen that the indexing performance per time might not be linear. In case of Apache Jena this was a major issue for estimating how long indexing takes. In 2020 I had therefore created a script to get the index performance data and analyze the results with excel.
https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits shows a graphical representation:
Based on the indexing performance function the Estimated Time of Arrival (aka completion) could be calculcated to improve the progress display of the indexing.
2022-01-30 08:20:50.890 - INFO: Input triples processed: 600,000,000
Since even counting all triples might take to long a count estimate could be done by checking the filesize against the progress or giving an accurate triple count as an input.
For wikidata the current triple count is available via the following SPARQL query: SPARQL query
The query fails on qlever