ETA Calculation for indexing

WolfgangFahl commented 2 years ago

From my experience with my https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData attempts i have seen that the indexing performance per time might not be linear. In case of Apache Jena this was a major issue for estimating how long indexing takes. In 2020 I had therefore created a script to get the index performance data and analyze the results with excel.

https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits shows a graphical representation: grafik

Based on the indexing performance function the Estimated Time of Arrival (aka completion) could be calculcated to improve the progress display of the indexing.

2022-01-30 08:20:50.890 - INFO: Input triples processed: 600,000,000

Since even counting all triples might take to long a count estimate could be done by checking the filesize against the progress or giving an accurate triple count as an input.

For wikidata the current triple count is available via the following SPARQL query: SPARQL query

The query fails on qlever

WolfgangFahl commented 2 years ago

The import seems to have a linear triple/time behavior: see Progress2022-01-30.xlsx

Time	Δs	∑s	Triples	Δ million/s	ØΔ million/s	Todo mill	Togo s	ETA
			13602102648
30.01.22 07:51
30.01.22 07:56	332	332	100000000	0,301	0,301	13502	44827	30.01.22 20:23
30.01.22 08:01	300	632	200000000	0,333	0,316	13402	42351	30.01.22 19:47
30.01.22 08:06	277	909	300000000	0,361	0,330	13302	40305	30.01.22 19:18
30.01.22 08:11	285	1194	400000000	0,351	0,335	13202	39408	30.01.22 19:07
30.01.22 08:15	294	1488	500000000	0,340	0,336	13102	38992	30.01.22 19:05
30.01.22 08:20	296	1784	600000000	0,338	0,336	13002	38660	30.01.22 19:05
30.01.22 08:25	302	2086	700000000	0,331	0,336	12902	38448	30.01.22 19:06
30.01.22 08:30	298	2384	800000000	0,336	0,336	12802	38150	30.01.22 19:06
30.01.22 08:35	295	2679	900000000	0,339	0,336	12702	37810	30.01.22 19:05
30.01.22 08:40	296	2975	1000000000	0,338	0,336	12602	37491	30.01.22 19:05
30.01.22 08:45	292	3267	1100000000	0,342	0,337	12502	37131	30.01.22 19:04
30.01.22 08:50	296	3563	1200000000	0,338	0,337	12402	36824	30.01.22 19:04
30.01.22 08:55	310	3873	1300000000	0,323	0,336	12302	36651	30.01.22 19:06
30.01.22 09:00	305	4178	1400000000	0,328	0,335	12202	36415	30.01.22 19:07
30.01.22 09:05	310	4488	1500000000	0,323	0,334	12102	36209	30.01.22 19:09
30.01.22 09:11	325	4813	1600000000	0,308	0,332	12002	36104	30.01.22 19:13
30.01.22 09:16	329	5142	1700000000	0,304	0,331	11902	36000	30.01.22 19:16
30.01.22 09:21	308	5450	1800000000	0,325	0,330	11802	35734	30.01.22 19:17
30.01.22 09:27	330	5780	1900000000	0,303	0,329	11702	35599	30.01.22 19:20
30.01.22 09:32	309	6089	2000000000	0,324	0,328	11602	35323	30.01.22 19:21
30.01.22 09:37	313	6402	2100000000	0,319	0,328	11502	35065	30.01.22 19:22
30.01.22 09:42	292	6694	2200000000	0,342	0,329	11402	34693	30.01.22 19:20
30.01.22 09:47	299	6993	2300000000	0,334	0,329	11302	34363	30.01.22 19:20
30.01.22 09:52	279	7272	2400000000	0,358	0,330	11202	33942	30.01.22 19:18
30.01.22 09:57	295	7567	2500000000	0,339	0,330	11102	33604	30.01.22 19:17

hannahbast commented 2 years ago

Yes, indexing time is more or less linear in the number of input triples. A rule of thumb is 1 billion triples per hour on a (modern) standard PC.

Note that what you are measuring above is just the first phase of the indexing (parsing of the input triples). That should have a speed at around 5 billion triples per hour. There are more phases after that. The whole indexing takes five times as long as the first phase alone.

WolfgangFahl commented 2 years ago

Sounds reasonable given that i reduced the settings by a factor of 5 to avoid hitting a memory limit. So it seems there is a tradeoff between speed and memory. Memory usage is currently at 12-15 GB so I could increase the settings again to use the available 64 GB better. I think i'll start another try on the Mac OS machine with the updated script where there are 12 cores and 64 GB of RAM available.

hannahbast commented 2 years ago

There is no need to restart, the batch size only has a relatively small influence on total indexing time. A batch size of 10M is just fine. However, you need to check ulimit -Sn on your machine then (number of files that can be opened simultaneously). The default setting on some machines is just 1024, that's not enough for Wikidata (QLever writes and reads two temporary files per batch). Set it to something like 1048576.

WolfgangFahl commented 2 years ago

I tried to find out how to modify the ulimit -Sn setting - it looks like that is a kernel parameter and would need a reboot. How do i modify that setting? https://serverfault.com/questions/216656/how-to-set-systemwide-ulimit-on-ubuntu was the article i found and i assume the nofile setting is the necessary change. https://unix.stackexchange.com/questions/8945/how-can-i-increase-open-files-limit-for-all-processes mentions that two files need to be changed.

In the script itself I am going to add

ulimit -Sn 1048576

hannahbast commented 2 years ago

@WolfgangFahl It's good enough if you change it for the shell in which you are running the index builder, is it not? So adding it to your script is just fine, as long as you change it in the right spot.

WolfgangFahl commented 2 years ago

@hannahbast the script with the modification seems to work - an attempt of yesterday created more than 1024 files already but was stopped by an unwanted server shutdown that had nothing to do with the qlever environment. This morning i started another try.

WolfgangFahl commented 2 years ago

For the https://wiki.bitplan.com/index.php/WikiData_Import_2022-03-11 i am using a script to trace the progress with a spreadsheet. On my apple numbers will happily display the formulas

#!/bin/bash
# WF 2022-03-12
# get the relevant log information for the indexer

#
# get relevant lines from the log
#
log() {
  local l_logfile="$1"
  egrep "Processing input triples|Input triples processed" $l_logfile
}
expected=16979
echo 'day;time;mill triples;duration;mill triples/sec;todo;ETA' > stats.csv
echo ";;$expected;;" >> stats.csv
log qlever-indices/wikidata/wikidata.index-log.txt \
    | cut -f1,2,8 -d' ' \
    | sed 's/ /;/g' \
    | sed 's/   -//g' \
    | sed 's/,//g' \
    | sed 's/\.[[:digit:]]\+//g' \
| awk '
BEGIN { FS=";"}
{
  date=$1
  time=$2
  triples=$3
  if (triples=="from")
    printf("%s;%s;;;;\n",date,time)
  else {
    row=NR+2
    printf("%s;%s;%s;=B%d-B$3;=C%d/D%d/3600;=C$2-C%d;=F%d/E%d\n",date,time,triples/1000000,row,row,row,row,row,row)
  }
}
' >> stats.csv
cat stats.csv
# open in spreadsheet
open stats.csv

WolfgangFahl commented 2 years ago

When the other phase will start I am intending to adapt the script grafik

WolfgangFahl commented 2 years ago

#!/bin/bash
# WF 2022-03-12
# get the relevant log information for the indexer
# $Header: /hd/seel/qlever/RCS/logstats,v 1.5 2022/06/28 05:43:00 wf Exp wf $

logfile=wikidata/wikidata.index-log.txt
stats=stats.csv
echo 'day;time;phase;mill triples;duration;mill triples/h;todo;ETA h' > /tmp/$stats
decimalsep=","
cat $logfile \
    | sed 's/ /;/g' \
    | sed 's/   -//g' \
    | sed 's/,//g' \
    | sed 's/\.[[:digit:]]\+//g' \
| awk -v expectedTriples=17500 -v expectedBoM=800 -v expectedUoM=3200 -v expectedConversion=28300 -v expectedWords=800 '
BEGIN {
  # Field separator
  FS=";"
    # double quote
  quote="\x22"
  result="Index building in progress ..."
}
# default extraction from line
# 2022-05-22 17:48:22.564 ...
{
  #print $0
  date=$1
  time=$2
}
# start of Processing phase
# 2022-05-22 17:48:22.564   - INFO:  Processing input triples from /dev/stdin ...
/Processing;input;triples;from/ {
  phase="Processing"
  printStartPhase(date,time,phase,expectedTriples)
  row=3
  next
}
# while processing
# 2022-05-23 00:09:50.846   - INFO:  Input triples processed: 17,400,000,000
/Input;triples;processed:;/{
  triples=$8
  next
}
# Start of byte order Merging
# 2022-05-23 00:10:52.614   - INFO:  Merging partial vocabularies in byte order (internal only) ...
/Merging;partial;vocabularies;in;byte;order/ {
  printrow(date,time,triples,row,phase)
  phase="Byte order merging"
  printStartPhase(date,time,phase,expectedBoM)
  row=5
  next
}
/Words;merged:;/ {
  triples=$7
    next
}
/Words;processed:;/ {
  triples=$7
    next
}
/Merging;partial;vocabularies;in;Unicode;order/ {
  printrow(date,time,triples,row,phase)
  phase="Unicode order merging"
  printStartPhase(date,time,phase,expectedUoM)
  row=7
    next
}
/Converting;triples;from;local;/ {
  printrow(date,time,triples,row,phase)
  phase="Triple conversion"
  printStartPhase(date,time,phase,expectedConversion)
  row=9
}
/Triples;converted:;/ {
  triples=$7
    next
}
/Building;/ {
    printrow(date,time,triples,row,phase)
    phase="Prefix tree"
    printStartPhase(date,time,phase,expectedWords)
    row=11
    next
}
/Computing;maximally/ {
    printrow(date,time,triples,row,phase)
    phase="Compressing prefixes"
    printStartPhase(date,time,phase,expectedTriples)
    row=13
    triples=expectedTriples*1000000
    next
}
/Writing;compressed;vocabulary/ {
    printrow(date,time,triples,row,phase)
    phase="PSO/POS index pair"
    printStartPhase(date,time,phase,expectedTriples)
    row=15
    triples=expectedTriples*1000000
    next
}
/Writing;meta;data;for;PSO/ {
    printrow(date,time,triples,row,phase)
    phase="SPO/SOP index pair"
    printStartPhase(date,time,phase,expectedTriples)
    row=17
    triples=expectedTriples*1000000
    next
}
/Writing;meta;data;for;SPO/ {
    printrow(date,time,triples,row,phase)
    phase="OSP/OPS index pair"
    printStartPhase(date,time,phase,expectedTriples)
    row=19
    triples=expectedTriples*1000000
    next
}
/Index;build;completed/ {
  result="Finished Index building successfully✅"
  printrow(date,time,triples,row,phase)
  printf(";;total;;=SUMME(E$2:E%d)\n",row)
}
# print start of the given phase
function printStartPhase(date,time,phase,expected) {
  printf("%s;%s;%s;%d;;;\n",date,time,phase,expected)
}
# print a row for the given phase
function printrow(date,time,triples,row,phase) {
  printf("%s;%s;%s;%s;=(A%d+B%d)-(A%d+B%d);%s=Runden(D%d/E%d;0)%s;=D%d-D%d;%s=Runden(G%d/F%d;1)%s\n",date,time,phase,triples/1000000,row,row,row-1,row-1,quote,row,row,quote,row-1,row,quote,row,row,quote)
}
END {
    printf(result)
}
' >> /tmp/$stats
cat /tmp/$stats | sed "s/\./$decimalsep/g" > $stats
cat stats.csv
# open in spreadsheet
open stats.csv

hannahbast commented 2 years ago

Revisiting this after some time. The index building log has been much improved in the meantime, with good progress info. Also the qlever script is available for quite some time now.

ad-freiburg / qlever

ETA Calculation for indexing #565