ad-freiburg / qlever-control

Apache License 2.0
20 stars 12 forks source link

Can I make index with my ttl data? #41

Closed givemetarte closed 3 weeks ago

givemetarte commented 3 months ago

Hi, I'm new to qlever, and I have few questions to index my ttl data.

In my lab, we build about 2 billion triples related to Korean address, and we want to test qlever performances for querying.

Questions are below:

  1. I have 196 turtle files (about 90G, not compressed), and How can I index the turtle files? As far as I understand, qlever get-data downloads data from an external source, but which function I don't need and all files are in local. Do I need to compress in one ttl.gz file?
  2. This is my Qleverfile for my data, but it doesn't work. I want to index all .ttl files in the INPUT_FILES path, and how can I write other settings?
# Qleverfile for hike, use with https://github.com/ad-freiburg/qlever-control
#
# qlever index
# qlever start

[data]
NAME         = hike
DESCRIPTION  = hike address data

[index]
INPUT_FILES     = /home/hike/qlever/qlever-indices/address/*.ttl
SETTINGS_JSON   = { "ascii-prefixes-only": true, "num-triples-per-batch": 10000000, "parallel-parsing" : true", "locale": { "language": "ko", "country": "KR", "ignore-punctuation": true } }
STXXL_MEMORY    = 10G

[server]
PORT               = 7001
ACCESS_TOKEN       = ${data:NAME}
MEMORY_FOR_QUERIES = 20G
CACHE_MAX_SIZE     = 10G

[runtime]
SYSTEM = docker
IMAGE  = docker.io/adfreiburg/qlever:latest

[ui]
UI_CONFIG = hike

It would be so pleasure if you answer !

Qup42 commented 3 months ago

You can use qlever get-data but it's not required. You can also just download and prepare the turtle files manually. To use this feature specify a shell command in GET_DATA_CMD in the [data] section. This command should then download and unzip the files into the current working directory. This enables other people who also use the Qleverfile to download the data easily.

  1. You don't need to generate one complete turtle file or even a single archive. You just have to ensure that the files you want to index are present in the directory of the Qleverfile (either using qlever get-data or manually) and then specify them in INPUT_FILES. Because you are running QLever in docker, the working directory is mounted as /index in the container. So you have to adjust the paths, e.g. to *.ttl.
  2. Have a look at the example Qleverfiles. Have a look at the options of the individual commands with --help. Almost all of the options correspond to an option in the Qleverfile, though I am afraid that the names in the config files are not really documented. Finally all the options are defined in code here.

If there are still errors, please also provide the concrete error message.

givemetarte commented 3 months ago

Thanks for your response. I successfully index my ttl data, but I got a new error when I run qlever start.

After qlever index, I got {host}.index-log.txt, {host}.settings.json, {host}.tmp.partial-vocabulary.*,{host}.unsorted-triples.dat files. After running qlever start, I got an error below:

Command: start

docker run -d --restart=unless-stopped -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -p 7001:7001 -w /index --init --entrypoint bash --name qlever.server.hike docker.io/adfreiburg/qlever:latest -c 'ServerMain -i hike -j 8 -p 7001 -m 20G -c 10G -e 1G -k 200 -s 30s -a hike_v78bB7fQfEvl > hike.server-log.txt 2>&1'

Follow hike.server-log.txt until the server is ready (Ctrl-C stops following the log, but not the server)

2024-05-30 06:13:44.922 - INFO: QLever Server, compiled on Thu May 23 14:26:32 UTC 2024 using git hash f9c313
2024-05-30 06:13:44.923 - INFO: Initializing server ...
2024-05-30 06:13:44.923 - ERROR: Could not open file "hike.meta-data.json" for reading. Possible causes: The file does not exist or the permissions are insufficient. The absolute path is "/index/hike.meta-data.json".

The error message said I have no hike.meta-data.json. hike.meta-data.json file was not generated automatically, after qlever index.

Qup42 commented 3 months ago

The index step didn't complete successfully. {host}.tmp.partial-vocabulary.* and {host}.unsorted-triples.dat are intermediate files created during indexing and deleted afterwards. In the end you should have {host}.index.*, {host}.meta-data.json and {host}.vocabulary.* files. Try re-running the index step (qlever index --overwrite-existing) and posting the index-log if that does not help.

givemetarte commented 3 months ago

I reindex my test file to run qlever index, and it seems that all index process doesn't work well.

My index-log is below (after qlever index and I never forcefully stopped):

2024-05-31 06:05:32.274 - INFO: QLever IndexBuilder, compiled on Thu May 23 14:26:32 UTC 2024 using git hash f9c313
2024-05-31 06:05:32.274 - INFO: You specified the input format: TTL
2024-05-31 06:05:32.274 - INFO: Processing input triples from /dev/stdin ...
2024-05-31 06:05:32.274 - INFO: Locale was not specified in settings file, default is en_US
2024-05-31 06:05:32.274 - INFO: You specified "locale = en_US" and "ignore-punctuation = 0"
2024-05-31 06:05:32.274 - INFO: You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2024-05-31 06:05:32.274 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files with a well-behaved use of newlines
2024-05-31 06:05:32.274 - INFO: You specified "num-triples-per-batch = 10,000,000", choose a lower value if the index builder runs out of memory
2024-05-31 06:05:32.274 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2024-05-31 06:05:32.327 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2024-05-31 06:06:19.858 - INFO: Triples parsed: 7,905,220 [average speed 0.2 M/s]
2024-05-31 06:06:38.983 - INFO: Number of triples created (including QLever-internal ones): 7,905,220 [may contain duplicates]
2024-05-31 06:06:38.983 - INFO: Merging partial vocabularies ...

It has not been progressed since then. As you can see below, the tmp files are not removed, and {host}.meta-data.json, {host}.index.* files are not generated. I didn't end the procedure abruptly, but qlever status said there is no processes. My issue seems to be similar to this one. (A test file is datahub-full-rna-*.ttl.)

Screenshot 2024-05-31 at 3 11 17 PM

Thanks for your kindness answers 😺

Qup42 commented 3 months ago

The progress messages (Triples parsed ...) are printed every 10 Mio triples. Your dataset is smaller, so no such message will be printed. At this point I am out of obvious ideas for the problems cause.

tomersagi commented 3 months ago

Jumping in here because I'm trying to do something similar. How do you make qlever index use your own Qleverfile?. O.k. nevermind. I figured it out. qlever --qleverfile Qleverfile.myproject

hannahbast commented 3 months ago

@tomersagi Why not just call your Qleverfile Qleverfile? Then you don't need that option.

hannahbast commented 3 months ago

@givemetarte Can you provide a link to your data, so that we can check if we can reproduce the error?

You can also try the following yourself, if you are sufficiently computer-savvy: When calling cmake, change -DCMAKE_BUILD_TYPE=Release to -DCMAKE_BUILD_TYPE=RelWithDebInfo (probably sufficient) or -DCMAKE_BUILD_TYPE=Debug (will be significantly slower). Either make that change in the Dockerfile and rebuild the image, or when you compile the binaries natively. Then run the index build using gdb. When it comes to the point, where the execution hangs, check in which part of the code you are and report back to us.

hannahbast commented 3 months ago

@givemetarte Here is a simpler thing which you can try first. In the Qleverfile, in the value for SETTINGS_JSON, replace "parallel-parsing" : true by "parallel-parsing" : false. Also not the stray " you had after the true, but maybe that was just a copy&paste error on your part.

givemetarte commented 3 months ago

@Qup42 I tried to test the Olympic datasets, but I encountered the same issue. qlever setup-config, qlever get-data run well, but the index process ended up with no {host}.index.* and {host}.meta-data.json.

My server spec is Ubuntu 20.04, 64G memory, 2TB hard disk, which seems to enough for running qlever.

Screenshot 2024-06-04 at 1 15 04 PM

I tested the same indexing work in my local laptop (MacOS), I got the same issue when I indexed the Olympic datasets.

givemetarte commented 3 months ago

@hannahbast I changed SETTINGS_JSON config ("parallel-parsing" : false), but there is no change. I tested the Olympic datasets on both Ubuntu and MacOs, the indexing did not well done. There is no host.index* files and host.meta-data.json. Screenshot 2024-06-04 at 1 36 04 PM

I have no idea how to debug with cmake. I tried but I don't know where I need to change the options, and when I git clone qlever-control, there is no CMakeLists.txt...

You can download my data via here. This link is running temporarily, so the link will be gone in a week. The final Qleverfile is below:

# Qleverfile for hike, use with https://github.com/ad-freiburg/qlever-control
#
# qlever index
# qlever start

[data]
NAME         = hike
DESCRIPTION  = hike address data

[index]
INPUT_FILES     = *.ttl
CAT_INPUT_FILES = cat ${INPUT_FILES}
SETTINGS_JSON   = { "ascii-prefixes-only": true, "num-triples-per-batch": 10000000, "parallel-parsing" : false}

[server]
PORT               = 7001
ACCESS_TOKEN       = ${data:NAME}_VEjVfajs2n1C
MEMORY_FOR_QUERIES = 20G
CACHE_MAX_SIZE     = 10G

[runtime]
SYSTEM = docker
IMAGE  = docker.io/adfreiburg/qlever:latest

[ui]
UI_CONFIG = hike
hannahbast commented 3 months ago

@givemetarte I just tried it with your dataset and your Qleverfile and it works without problems. It is also very unusual that the olympics dataset does not work. So there must be something unusual about your machine and we have to find out what it is.

Is there anything that comes to your mind?

tomersagi commented 3 months ago

@tomersagi Why not just call your Qleverfile Qleverfile? Then you don't need that option.

That's great. Would be nice if that information was in the README or anywhere.

givemetarte commented 3 months ago

@hannahbast Thanks for testing! I've tried running qlever indexing on three different servers. 2 servers (Ubuntu, and MacOS mentioned before) did not work, but the other ubuntu server worked well (I tested right after your answer). I have no idea about the difference between a server where qlever works and a server where it doesn't. Python version is the same (3.8.10)... Anyway, thanks to this, I have at least one server qlever is working.

hannahbast commented 3 months ago

@givemetarte Thanks for the update. That is really strange and it would be great to find out what the problem is. Can you provide information about each of the two machines with Ubuntu: which Ubuntu version is it, how much RAM does the respective machine have, and which version of Docker is installed on the respective machine?

givemetarte commented 2 months ago

@hannahbast Ubuntu servers' spec is below:

  1. Ubuntu 20.04, RAM 16G+32G, 2TB disk storage, docker version 20.10.22 (which qlever is working)
  2. Ubuntu 20.04, RAM 64G, 2TB disk storage, docker version 24.0.2 (which qlever is not working)

I downgraded the docker version, but that was not the reason. Qlever is still not working...

dssib commented 1 month ago

@hannahbast I changed SETTINGS_JSON config ("parallel-parsing" : false), but there is no change. I tested the Olympic datasets on both Ubuntu and MacOs, the indexing did not well done. There is no host.index* files and host.meta-data.json. ```

Hi, For me what seems to have worked is to decrease the num-triples-per-batch. After reducing this to 100K the indexer worked. I guess there might be an unreported out of memory issue? Because the output suggests choose a lower value if the index builder runs out of memory and this fixed the problem, but there was no out of memory reported before changing it, the indexer just seemed to stop abruptly.

givemetarte commented 3 weeks ago

Thanks for your answers. @dssib I reset num-triples-per-batch to 100000, and reindexed my data (about 9,450 triples). Something different, but I got a new error.

Command: index

echo '{ "ascii-prefixes-only": true, "num-triples-per-batch": 100000, "parallel-parsing" : false}' > hike.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.hike docker.io/adfreiburg/qlever:latest -c 'cat *.ttl | IndexBuilderMain -F ttl -f - -i hike -s hike.settings.json --stxxl-memory 5G | tee hike.index-log.txt'

2024-08-12 18:45:06.811 - INFO: QLever IndexBuilder, compiled on Thu May 23 14:56:04 UTC 2024 using git hash f9c313
2024-08-12 18:45:06.826 - INFO: You specified the input format: TTL
2024-08-12 18:45:06.826 - INFO: Processing input triples from /dev/stdin ...
2024-08-12 18:45:06.838 - INFO: Locale was not specified in settings file, default is en_US
2024-08-12 18:45:06.838 - INFO: You specified "locale = en_US" and "ignore-punctuation = 0"
2024-08-12 18:45:06.838 - INFO: You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2024-08-12 18:45:06.838 - INFO: You specified "num-triples-per-batch = 100,000", choose a lower value if the index builder runs out of memory
2024-08-12 18:45:06.838 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2024-08-12 18:45:07.158 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2024-08-12 18:53:59.097 - INFO: Triples parsed: 94,589,734 [average speed 0.2 M/s, last batch 0.2 M/s, fastest 0.2 M/s, slowest 0.1 M/s] 
2024-08-12 18:53:59.270 - INFO: Number of triples created (including QLever-internal ones): 94,589,734 [may contain duplicates]
2024-08-12 18:53:59.271 - INFO: Merging partial vocabularies ...
2024-08-12 18:53:59.282 - INFO: Finished writing compressed internal vocabulary, size = 0 B [uncompressed = 0 B, ratio = 100%]
2024-08-12 18:53:59.762 - ERROR: ! ERROR opening file "hike.vocabulary.external.words" with mode "w" (No such file or directory)

tmp files are not deleted, and also hike.vocabulary.external.words file is not created. Also, there is no docker error (in docker logs). However, when I reindexed with fewer triples (about 12,300,000 triples), it worked well. I think the merging partial vocabularies seems to be not working properly, with a huge increase in memory usage.

dssib commented 3 weeks ago

I think it's indeed due to a memory error. On my side to avoid this I ended up merging the source .ttl files "by hand" first (I had a few hundreds of them), this will avoid having to merge vocabularies after the indexing. You could try this if that's also an option on your side. But perhaps the developers have also some suggestions, this one is based only on my trials and errors (in the end I did succeed to get everything indexed and running).

givemetarte commented 3 weeks ago

@dssib Thanks for sharing your trials and errors. Unfortunately, Merging all ttl files is not worked for me. I still got the same error, and I'll try to run qlever in another machine (much higher memory).

@hannahbast Could you recommend proper machine memory per triples? What is the minimum required to create an index on about 90 million triples?

givemetarte commented 3 weeks ago

I changed Qleverfile options like wikidata. I run qlever index in mac (m1, 16G memory, 500GB), and the indexing was completed in about 25 mins. @dssib You could try setting STXXL_MEMORY for better indexing. Thanks for your all comments!

# Qleverfile for hike, use with https://github.com/ad-freiburg/qlever-control
#
# qlever index
# qlever start

[data]
NAME         = hike
DESCRIPTION  = hike address data

[index]
INPUT_FILES     = *.ttl
CAT_INPUT_FILES = cat ${INPUT_FILES}
SETTINGS_JSON   = { "languages-internal": [], "prefixes-external": [""], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 500000, "parallel-parsing" : false}
STXXL_MEMORY    = 10G

[server]
PORT               = 7001
ACCESS_TOKEN       = ${data:NAME}_VEjVfajs2n1C
MEMORY_FOR_QUERIES = 10G
CACHE_MAX_SIZE     = 6G

[runtime]
SYSTEM = docker
IMAGE  = docker.io/adfreiburg/qlever:latest

[ui]
UI_CONFIG = hike
Command: index-stats

Breakdown of the time used for building the index, based on the timestamps for key lines in "hike.index-log.txt"

Parse input           :   10.5 min
Build vocabularies    :   11.0 min
Convert to global IDs :    0.8 min
Permutation SPO & SOP :    0.5 min
Permutation OSP & OPS :    1.0 min
Permutation PSO & POS :    0.8 min

TOTAL time            :   24.6 min

Breakdown of the space used for building the index

Files index.*         :    1.0 GB
Files vocabulary.*    :    1.2 GB

TOTAL size            :    2.2 GB