ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
369 stars 44 forks source link

ETA Calculation for indexing #565

Closed WolfgangFahl closed 2 years ago

WolfgangFahl commented 2 years ago

From my experience with my https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData attempts i have seen that the indexing performance per time might not be linear. In case of Apache Jena this was a major issue for estimating how long indexing takes. In 2020 I had therefore created a script to get the index performance data and analyze the results with excel.

https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits shows a graphical representation: grafik

Based on the indexing performance function the Estimated Time of Arrival (aka completion) could be calculcated to improve the progress display of the indexing.

2022-01-30 08:20:50.890 - INFO: Input triples processed: 600,000,000

Since even counting all triples might take to long a count estimate could be done by checking the filesize against the progress or giving an accurate triple count as an input.

For wikidata the current triple count is available via the following SPARQL query: SPARQL query

The query fails on qlever

WolfgangFahl commented 2 years ago

The import seems to have a linear triple/time behavior: see Progress2022-01-30.xlsx

Time Δs ∑s Triples Δ million/s ØΔ million/s Todo mill Togo s ETA
13602102648
30.01.22 07:51
30.01.22 07:56 332 332 100000000 0,301 0,301 13502 44827 30.01.22 20:23
30.01.22 08:01 300 632 200000000 0,333 0,316 13402 42351 30.01.22 19:47
30.01.22 08:06 277 909 300000000 0,361 0,330 13302 40305 30.01.22 19:18
30.01.22 08:11 285 1194 400000000 0,351 0,335 13202 39408 30.01.22 19:07
30.01.22 08:15 294 1488 500000000 0,340 0,336 13102 38992 30.01.22 19:05
30.01.22 08:20 296 1784 600000000 0,338 0,336 13002 38660 30.01.22 19:05
30.01.22 08:25 302 2086 700000000 0,331 0,336 12902 38448 30.01.22 19:06
30.01.22 08:30 298 2384 800000000 0,336 0,336 12802 38150 30.01.22 19:06
30.01.22 08:35 295 2679 900000000 0,339 0,336 12702 37810 30.01.22 19:05
30.01.22 08:40 296 2975 1000000000 0,338 0,336 12602 37491 30.01.22 19:05
30.01.22 08:45 292 3267 1100000000 0,342 0,337 12502 37131 30.01.22 19:04
30.01.22 08:50 296 3563 1200000000 0,338 0,337 12402 36824 30.01.22 19:04
30.01.22 08:55 310 3873 1300000000 0,323 0,336 12302 36651 30.01.22 19:06
30.01.22 09:00 305 4178 1400000000 0,328 0,335 12202 36415 30.01.22 19:07
30.01.22 09:05 310 4488 1500000000 0,323 0,334 12102 36209 30.01.22 19:09
30.01.22 09:11 325 4813 1600000000 0,308 0,332 12002 36104 30.01.22 19:13
30.01.22 09:16 329 5142 1700000000 0,304 0,331 11902 36000 30.01.22 19:16
30.01.22 09:21 308 5450 1800000000 0,325 0,330 11802 35734 30.01.22 19:17
30.01.22 09:27 330 5780 1900000000 0,303 0,329 11702 35599 30.01.22 19:20
30.01.22 09:32 309 6089 2000000000 0,324 0,328 11602 35323 30.01.22 19:21
30.01.22 09:37 313 6402 2100000000 0,319 0,328 11502 35065 30.01.22 19:22
30.01.22 09:42 292 6694 2200000000 0,342 0,329 11402 34693 30.01.22 19:20
30.01.22 09:47 299 6993 2300000000 0,334 0,329 11302 34363 30.01.22 19:20
30.01.22 09:52 279 7272 2400000000 0,358 0,330 11202 33942 30.01.22 19:18
30.01.22 09:57 295 7567 2500000000 0,339 0,330 11102 33604 30.01.22 19:17
hannahbast commented 2 years ago

Yes, indexing time is more or less linear in the number of input triples. A rule of thumb is 1 billion triples per hour on a (modern) standard PC.

Note that what you are measuring above is just the first phase of the indexing (parsing of the input triples). That should have a speed at around 5 billion triples per hour. There are more phases after that. The whole indexing takes five times as long as the first phase alone.

WolfgangFahl commented 2 years ago

Sounds reasonable given that i reduced the settings by a factor of 5 to avoid hitting a memory limit. So it seems there is a tradeoff between speed and memory. Memory usage is currently at 12-15 GB so I could increase the settings again to use the available 64 GB better. I think i'll start another try on the Mac OS machine with the updated script where there are 12 cores and 64 GB of RAM available.

hannahbast commented 2 years ago

There is no need to restart, the batch size only has a relatively small influence on total indexing time. A batch size of 10M is just fine. However, you need to check ulimit -Sn on your machine then (number of files that can be opened simultaneously). The default setting on some machines is just 1024, that's not enough for Wikidata (QLever writes and reads two temporary files per batch). Set it to something like 1048576.

WolfgangFahl commented 2 years ago

I tried to find out how to modify the ulimit -Sn setting - it looks like that is a kernel parameter and would need a reboot. How do i modify that setting? https://serverfault.com/questions/216656/how-to-set-systemwide-ulimit-on-ubuntu was the article i found and i assume the nofile setting is the necessary change. https://unix.stackexchange.com/questions/8945/how-can-i-increase-open-files-limit-for-all-processes mentions that two files need to be changed.

In the script itself I am going to add

ulimit -Sn 1048576
hannahbast commented 2 years ago

@WolfgangFahl It's good enough if you change it for the shell in which you are running the index builder, is it not? So adding it to your script is just fine, as long as you change it in the right spot.

WolfgangFahl commented 2 years ago

@hannahbast the script with the modification seems to work - an attempt of yesterday created more than 1024 files already but was stopped by an unwanted server shutdown that had nothing to do with the qlever environment. This morning i started another try.

WolfgangFahl commented 2 years ago

For the https://wiki.bitplan.com/index.php/WikiData_Import_2022-03-11 i am using a script to trace the progress with a spreadsheet. On my apple numbers will happily display the formulas

#!/bin/bash
# WF 2022-03-12
# get the relevant log information for the indexer

#
# get relevant lines from the log
#
log() {
  local l_logfile="$1"
  egrep "Processing input triples|Input triples processed" $l_logfile
}
expected=16979
echo 'day;time;mill triples;duration;mill triples/sec;todo;ETA' > stats.csv
echo ";;$expected;;" >> stats.csv
log qlever-indices/wikidata/wikidata.index-log.txt \
    | cut -f1,2,8 -d' ' \
    | sed 's/ /;/g' \
    | sed 's/   -//g' \
    | sed 's/,//g' \
    | sed 's/\.[[:digit:]]\+//g' \
| awk '
BEGIN { FS=";"}
{
  date=$1
  time=$2
  triples=$3
  if (triples=="from")
    printf("%s;%s;;;;\n",date,time)
  else {
    row=NR+2
    printf("%s;%s;%s;=B%d-B$3;=C%d/D%d/3600;=C$2-C%d;=F%d/E%d\n",date,time,triples/1000000,row,row,row,row,row,row)
  }
}
' >> stats.csv
cat stats.csv
# open in spreadsheet
open stats.csv
WolfgangFahl commented 2 years ago

When the other phase will start I am intending to adapt the script grafik

WolfgangFahl commented 2 years ago
#!/bin/bash
# WF 2022-03-12
# get the relevant log information for the indexer
# $Header: /hd/seel/qlever/RCS/logstats,v 1.5 2022/06/28 05:43:00 wf Exp wf $

logfile=wikidata/wikidata.index-log.txt
stats=stats.csv
echo 'day;time;phase;mill triples;duration;mill triples/h;todo;ETA h' > /tmp/$stats
decimalsep=","
cat $logfile \
    | sed 's/ /;/g' \
    | sed 's/   -//g' \
    | sed 's/,//g' \
    | sed 's/\.[[:digit:]]\+//g' \
| awk -v expectedTriples=17500 -v expectedBoM=800 -v expectedUoM=3200 -v expectedConversion=28300 -v expectedWords=800 '
BEGIN {
  # Field separator
  FS=";"
    # double quote
  quote="\x22"
  result="Index building in progress ..."
}
# default extraction from line
# 2022-05-22 17:48:22.564 ...
{
  #print $0
  date=$1
  time=$2
}
# start of Processing phase
# 2022-05-22 17:48:22.564   - INFO:  Processing input triples from /dev/stdin ...
/Processing;input;triples;from/ {
  phase="Processing"
  printStartPhase(date,time,phase,expectedTriples)
  row=3
  next
}
# while processing
# 2022-05-23 00:09:50.846   - INFO:  Input triples processed: 17,400,000,000
/Input;triples;processed:;/{
  triples=$8
  next
}
# Start of byte order Merging
# 2022-05-23 00:10:52.614   - INFO:  Merging partial vocabularies in byte order (internal only) ...
/Merging;partial;vocabularies;in;byte;order/ {
  printrow(date,time,triples,row,phase)
  phase="Byte order merging"
  printStartPhase(date,time,phase,expectedBoM)
  row=5
  next
}
/Words;merged:;/ {
  triples=$7
    next
}
/Words;processed:;/ {
  triples=$7
    next
}
/Merging;partial;vocabularies;in;Unicode;order/ {
  printrow(date,time,triples,row,phase)
  phase="Unicode order merging"
  printStartPhase(date,time,phase,expectedUoM)
  row=7
    next
}
/Converting;triples;from;local;/ {
  printrow(date,time,triples,row,phase)
  phase="Triple conversion"
  printStartPhase(date,time,phase,expectedConversion)
  row=9
}
/Triples;converted:;/ {
  triples=$7
    next
}
/Building;/ {
    printrow(date,time,triples,row,phase)
    phase="Prefix tree"
    printStartPhase(date,time,phase,expectedWords)
    row=11
    next
}
/Computing;maximally/ {
    printrow(date,time,triples,row,phase)
    phase="Compressing prefixes"
    printStartPhase(date,time,phase,expectedTriples)
    row=13
    triples=expectedTriples*1000000
    next
}
/Writing;compressed;vocabulary/ {
    printrow(date,time,triples,row,phase)
    phase="PSO/POS index pair"
    printStartPhase(date,time,phase,expectedTriples)
    row=15
    triples=expectedTriples*1000000
    next
}
/Writing;meta;data;for;PSO/ {
    printrow(date,time,triples,row,phase)
    phase="SPO/SOP index pair"
    printStartPhase(date,time,phase,expectedTriples)
    row=17
    triples=expectedTriples*1000000
    next
}
/Writing;meta;data;for;SPO/ {
    printrow(date,time,triples,row,phase)
    phase="OSP/OPS index pair"
    printStartPhase(date,time,phase,expectedTriples)
    row=19
    triples=expectedTriples*1000000
    next
}
/Index;build;completed/ {
  result="Finished Index building successfully✅"
  printrow(date,time,triples,row,phase)
  printf(";;total;;=SUMME(E$2:E%d)\n",row)
}
# print start of the given phase
function printStartPhase(date,time,phase,expected) {
  printf("%s;%s;%s;%d;;;\n",date,time,phase,expected)
}
# print a row for the given phase
function printrow(date,time,triples,row,phase) {
  printf("%s;%s;%s;%s;=(A%d+B%d)-(A%d+B%d);%s=Runden(D%d/E%d;0)%s;=D%d-D%d;%s=Runden(G%d/F%d;1)%s\n",date,time,phase,triples/1000000,row,row,row-1,row-1,quote,row,row,quote,row-1,row,quote,row,row,quote)
}
END {
    printf(result)
}
' >> /tmp/$stats
cat /tmp/$stats | sed "s/\./$decimalsep/g" > $stats
cat stats.csv
# open in spreadsheet
open stats.csv
hannahbast commented 2 years ago

Revisiting this after some time. The index building log has been much improved in the meantime, with good progress info. Also the qlever script is available for quite some time now.