ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
356 stars 42 forks source link

docker index command stop #1357

Open fbelleau opened 3 months ago

fbelleau commented 3 months ago

I am able to index 2 nt files individually, but when I concat them, indexing stop without any message.

Command: index

echo '{ "ascii-prefixes-only": false, "num-triples-per-batch": 1000 }' > olympics.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.olympics docker.io/adfreiburg/qlever:latest -c 'cat flymine-*.nt | IndexBuilderMain -F ttl -f - -i olympics -s olympics.settings.json --stxxl-memory 5G | tee olympics.index-log.txt'

2024-06-01 06:26:41.550 - INFO: QLever IndexBuilder, compiled on Tue Apr  2 19:02:03 UTC 2024 using git hash 25449d
2024-06-01 06:26:41.552 - INFO: You specified the input format: TTL
2024-06-01 06:26:41.552 - INFO: Processing input triples from /dev/stdin ...
2024-06-01 06:26:41.553 - INFO: Locale was not specified in settings file, default is en_US
2024-06-01 06:26:41.553 - INFO: You specified "locale = en_US" and "ignore-punctuation = 0"
2024-06-01 06:26:41.554 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files that don't include multiline literals with unescaped newline characters and that have newline characters after the end of triples.
2024-06-01 06:26:41.554 - INFO: You specified "num-triples-per-batch = 1,000", choose a lower value if the index builder runs out of memory
2024-06-01 06:26:41.554 - INFO: Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2024-06-01 06:28:17.973 - INFO: Done, total number of triples read: 11,359,551 [may contain duplicates]
2024-06-01 06:28:17.974 - INFO: Number of QLever-internal triples created: 11,359,551 [may contain duplicates]
2024-06-01 06:28:17.974 - INFO: Merging partial vocabularies ...
hannahbast commented 3 months ago

@fbelleau That is strange, can you send a link to these two .nt files? (Here or by mail if you don't want the link to appear on a public website)

fbelleau commented 3 months ago

Freed memory on the server, allowing the job to complete successfully. The job previously crashed due to insufficient memory allocation.

How much memory to index 50 Go of ntriple files do you think is needed ?

fbelleau commented 3 months ago

Adding memory solved the problem.

fbelleau commented 3 months ago

@hannahbast

now I have the same problem with a larger file and memory do not seems to be a problem.

echo '{ "ascii-prefixes-only": false, "num-triples-per-batch": 100000 }' > flymine-object.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.flymine-object docker.io/adfreiburg/qlever:latest -c 'ulimit -Sn 1048576; cat ./data/xa* | IndexBuilderMain -F ttl -f - -i flymine-object -s flymine-object.settings.json --stxxl-memory 5G | tee flymine-object.index-log.txt'

2024-06-06 15:52:46.688 - INFO: QLever IndexBuilder, compiled on Tue Apr  2 19:02:03 UTC 2024 using git hash 25449d
2024-06-06 15:52:46.688 - INFO: You specified the input format: TTL
2024-06-06 15:52:46.688 - INFO: Processing input triples from /dev/stdin ...
2024-06-06 15:52:46.690 - INFO: Locale was not specified in settings file, default is en_US
2024-06-06 15:52:46.690 - INFO: You specified "locale = en_US" and "ignore-punctuation = 0"
2024-06-06 15:52:46.691 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files that don't include multiline literals with unescaped newline characters and that have newline characters after the end of triples.
2024-06-06 15:52:46.691 - INFO: You specified "num-triples-per-batch = 100,000", choose a lower value if the index builder runs out of memory
2024-06-06 15:52:46.691 - INFO: Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2024-06-06 15:58:09.398 - INFO: Input triples processed: 100,000,000
2024-06-06 16:03:30.491 - INFO: Input triples processed: 200,000,000
2024-06-06 16:09:28.228 - INFO: Done, total number of triples read: 291,005,465 [may contain duplicates]
2024-06-06 16:09:28.230 - INFO: Number of QLever-internal triples created: 291,005,465 [may contain duplicates]
2024-06-06 16:09:28.230 - INFO: Merging partial vocabularies ...
2024-06-06 16:14:05.367 - INFO: Finished writing compressed external vocabulary, size = 0 B [uncompressed = 0 B, ratio = 100%]
2024-06-06 16:14:06.669 - INFO: Finished writing compressed internal vocabulary, size = 777.9 MB [uncompressed = 3.6 GB, ratio = 21%]
2024-06-06 16:14:06.733 - INFO: Number of words in external vocabulary: 80,929,190
2024-06-06 16:14:06.734 - INFO: Removing temporary files ...
2024-06-06 16:14:07.231 - INFO: Converting triples from local IDs to global IDs ...
2024-06-06 16:14:25.092 - INFO: Triples converted: 100,000,000
2024-06-06 16:14:40.650 - INFO: Triples converted: 200,000,000
2024-06-06 16:14:54.726 - INFO: Done, total number of triples converted: 291,005,465
2024-06-06 16:14:54.774 - INFO: Creating a pair of index permutations ...
2024-06-06 16:15:30.379 - INFO: Triples processed: 100,000,000
2024-06-06 16:15:59.980 - INFO: Triples processed: 200,000,000
2024-06-06 16:16:25.258 - INFO: Number of unique elements: 291,005,465
2024-06-06 16:16:27.963 - INFO: Statistics for SPO: #relations = 34,758,461, #blocks = 6,209, #triples = 291,005,465
2024-06-06 16:16:27.967 - INFO: Statistics for SOP: #relations = 34,758,461, #blocks = 6,209, #triples = 291,005,465
2024-06-06 16:16:27.968 - INFO: Writing meta data for SPO and SOP ...
2024-06-06 16:16:27.982 - INFO: Number of distinct patterns: 170
2024-06-06 16:16:27.982 - INFO: Number of subjects with pattern: 34,758,461 [all]
2024-06-06 16:16:27.982 - INFO: Total number of distinct subject-predicate pairs: 291,005,465
2024-06-06 16:16:27.982 - INFO: Average number of predicates per subject: 8.4
2024-06-06 16:16:27.984 - INFO: Average number of subjects per predicate: 2,852,995
2024-06-06 16:16:28.076 - INFO: Creating a pair of index permutations ...
2024-06-06 16:17:24.113 - INFO: Triples processed: 100,000,000

I am working on a 8G RAM 4 cores server.

The file I am indexing is here :

https://huggingface.co/datasets/bio2rdf/flymine_nt/tree/main

and there is a copy of my Qleverfile. I use the qlever python command.