ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
372 stars 44 forks source link

NQ parsing: IndexBuilderMain "merging partial vocabularies" takes very long time #1468

Open Stiksels opened 3 weeks ago

Stiksels commented 3 weeks ago

Issue description Trying to build index for a zipped nquads file (~2mio named graphs, ~140mio triples). The proces gets stuck on "Merging partial vocabularies" for over 2hours now...

Logs

2024-08-27 16:11:38.977 - INFO: QLever IndexBuilder, compiled on Tue Aug 27 06:08:21 UTC 2024 using git hash d900cd
2024-08-27 16:11:38.979 - INFO: You specified the input format: NQ
2024-08-27 16:11:38.979 - INFO: Processing input triples from /dev/stdin ...
2024-08-27 16:11:38.981 - INFO: You specified "locale = nl_BE" and "ignore-punctuation = 1"
2024-08-27 16:11:38.981 - WARN: You are using Locale settings that differ from the default language or country.
        This should work but is untested by the QLever team. If you are running into unexpected problems,
        Please make sure to also report your used locale when filing a bug report. Also note that changing the
        locale requires to completely rebuild the index
2024-08-27 16:11:38.981 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files with a well-behaved use of newlines
2024-08-27 16:11:38.981 - INFO: You specified "num-triples-per-batch = 100,000", choose a lower value if the index builder runs out of memory
2024-08-27 16:11:38.981 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2024-08-27 16:11:39.133 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2024-08-27 16:12:06.892 - INFO: Triples parsed: 10,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:12:30.420 - INFO: Triples parsed: 20,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:12:56.159 - INFO: Triples parsed: 30,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:13:24.150 - INFO: Triples parsed: 40,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:13:51.562 - INFO: Triples parsed: 50,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:14:19.591 - INFO: Triples parsed: 60,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:14:45.409 - INFO: Triples parsed: 70,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:15:15.178 - INFO: Triples parsed: 80,000,000 [average speed 0.4 M/s, last batch 0.3 M/s, fas2024-08-27 16:15:42.560 - INFO: Triples parsed: 90,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:16:16.543 - INFO: Triples parsed: 100,000,000 [average speed 0.4 M/s, last batch 0.3 M/s, fa2024-08-27 16:16:44.003 - INFO: Triples parsed: 110,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fa2024-08-27 16:17:12.634 - INFO: Triples parsed: 120,000,000 [average speed 0.4 M/s, last batch 0.3 M/s, fa2024-08-27 16:17:40.213 - INFO: Triples parsed: 130,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fa2024-08-27 16:18:07.096 - INFO: Triples parsed: 140,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fa2024-08-27 16:18:12.618 - INFO: Triples parsed: 142,064,904 [average speed 0.4 M/s, last batch 0.4 M/s, fastest 0.4 M/s, slowest 0.3 M/s] 
2024-08-27 16:18:12.773 - INFO: Number of triples created (including QLever-internal ones): 169,946,282 [may contain duplicates]
2024-08-27 16:18:12.774 - INFO: Merging partial vocabularies ...
Screenshot 2024-08-27 at 17 16 54
joka921 commented 3 weeks ago

@Stiksels Thanks for reporting this. Can you give us access to the NQ file and your used Settings (QLeverfile or commandline options/settings.json file) for the IndexBuilder, so we can locally reproduce this?

Stiksels commented 3 weeks ago

@joka921 it eventually did work, the merging of the partial vocabularies took 3+ hours.

echo '{ "locale": { "language": "nl", "country": "BE", "ignore-punctuation": true }, "ascii-prefixes-only": false, "num-triples-per-batch": 100000 }' > uit-activiteiten-full-nq.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.uit-activiteiten-full-nq docker.io/adfreiburg/qlever:latest -c 'zcat testdata/publiq-uit-activiteiten_2024-08-23_12-30-13.nq.gz | IndexBuilderMain -F nq -f - -i uit-activiteiten-full-nq -s uit-activiteiten-full-nq.settings.json --stxxl-memory 5G | tee uit-activiteiten-full-nq.index-log.txt'

Here is the log: uit-activiteiten-full-nq.index-log.txt

Stiksels commented 3 weeks ago

Some additional info, I'm running on an older MacBook Pro model:

In our cloud /K8S setup (amd64), the index build for the compressed nquads file took ~2h in total (faster than the compressed ntriples file)

I'll add a download link for the file shortly

hannahbast commented 3 weeks ago

@Stiksels Can you provide a link to the NQ file?

Stiksels commented 3 weeks ago

Hi @hannahbast , here is the downloadlink (exp 12h):

https://qlever-backups.s3.eu-west-1.amazonaws.com/activiteiten.nq.gz?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBAaCmV1LW5vcnRoLTEiRjBEAiBoz0XRms9EuYERPrfkcoOoMni%2BmihELxOmzaT7Wh%2Bz7wIgcEA7pj%2BaDVHAFZT1DVkpwz4rCbwJu8XWTwR7yLro%2FTsq6AIIKRAAGgw2NTc3MzczOTIzMzIiDIHtMnfc7yEP%2BrJLTirFAg6ygDKWHKDepEyeGLBH1cNZZdF2uCqYS9LnixT3%2FywT1GhRuSl3CU%2FecRuYYok2Ps0Mzi0yb%2FLIjPCf19RXzRDGtXVphyFE9nmlR3dnuLb%2Byup3Ed6GwUa3B8C18U5O%2FoVSZq7c4iyftb48S92694iQew2LB7rGb2rRO3siBcClGWqVtHctUE%2FdxpVzMCt5omkdEF16xNnULYYXjuNu0tll1zdsLAzhZdw0lXjg9RBZSFPHpquGQOn6HNMjXlmz8FO6EDwwxjdDRCcf5Qwmj3IzOtbIBxrjCc8CAsiZrjv4qMfTCARiARlfDSbbWBoydeSFXQJFBtFRHnWCiejS79kTJITEtA%2Bmi9T%2BcFxC%2BbM8Icod5XlXENlIY4U3h8ednz27itIve7ruYFZt6YNR9eH%2BuotW6c%2BuEfea9wrBF%2FqPhN09tIkwsa%2B7tgY6iAIfatKk6rcVVCO4BPxFybPawxcHXeMIYqhaqszbwzSfCvf491esLVb7CdAf14DHzrCwh%2BM9G6eKSvqkjhFMv3lWuqM7KlItZBSY8u58pa9TAkDs1wnp8mi0B%2FgjfP%2FcYx30Fef5%2BOOR%2Fo0gJZaiuLQB6DAtYHfg%2BV%2BvVVi%2FbFKGHs9CZ8NF4QcwHKXmfbdLnF1yvxnd2lq7wDhVKcwW7qMkhjf6UuNeOO7BPawaPHeE8X4Xx80m5X3FpvU78aEBsP5f0TZwivS13ay7IfkBTGPTIgoH1%2FZvsvfDeoz5330KQKXEvK1pvSWRqz%2FYmiXZBoLnq2Wk136Hn3xShUUk3THpyy0TV1%2Fb5XA%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240828T092620Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAZSJBUEDGOJDPNQ5Y%2F20240828%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=8419105cd8b15a933bc3ad9ac2ccdcadad53478c3284d8b9a6528932e8576cef
Stiksels commented 3 weeks ago

Following up on this: I ran qlever index for this file on a new MacBook Pro M3 / 18GB and the entire process took ~5min

uit-activiteiten-full-nq.index-log (1).txt

Not sure if it's necessary or high prio to support older devices? I put in a request for a new laptop 😂

hannahbast commented 2 weeks ago

@Stiksels It's not necessarily about the age of the computer, but about the version of the compiler and maybe the operating system. The merging of the vocabularies handles many files using many threads. It seems that with older compilers and/or older operating systems, the machine code produced does something crazily non-optimal. We haven't figured out exactly what yet.

hannahbast commented 2 weeks ago

Hi @hannahbast , here is the downloadlink (exp 12h):

https://qlever-backups.s3.eu-west-1.amazonaws.com/activiteiten.nq.gz?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBAaCmV1LW5vcnRoLTEiRjBEAiBoz0XRms9EuYERPrfkcoOoMni%2BmihELxOmzaT7Wh%2Bz7wIgcEA7pj%2BaDVHAFZT1DVkpwz4rCbwJu8XWTwR7yLro%2FTsq6AIIKRAAGgw2NTc3MzczOTIzMzIiDIHtMnfc7yEP%2BrJLTirFAg6ygDKWHKDepEyeGLBH1cNZZdF2uCqYS9LnixT3%2FywT1GhRuSl3CU%2FecRuYYok2Ps0Mzi0yb%2FLIjPCf19RXzRDGtXVphyFE9nmlR3dnuLb%2Byup3Ed6GwUa3B8C18U5O%2FoVSZq7c4iyftb48S92694iQew2LB7rGb2rRO3siBcClGWqVtHctUE%2FdxpVzMCt5omkdEF16xNnULYYXjuNu0tll1zdsLAzhZdw0lXjg9RBZSFPHpquGQOn6HNMjXlmz8FO6EDwwxjdDRCcf5Qwmj3IzOtbIBxrjCc8CAsiZrjv4qMfTCARiARlfDSbbWBoydeSFXQJFBtFRHnWCiejS79kTJITEtA%2Bmi9T%2BcFxC%2BbM8Icod5XlXENlIY4U3h8ednz27itIve7ruYFZt6YNR9eH%2BuotW6c%2BuEfea9wrBF%2FqPhN09tIkwsa%2B7tgY6iAIfatKk6rcVVCO4BPxFybPawxcHXeMIYqhaqszbwzSfCvf491esLVb7CdAf14DHzrCwh%2BM9G6eKSvqkjhFMv3lWuqM7KlItZBSY8u58pa9TAkDs1wnp8mi0B%2FgjfP%2FcYx30Fef5%2BOOR%2Fo0gJZaiuLQB6DAtYHfg%2BV%2BvVVi%2FbFKGHs9CZ8NF4QcwHKXmfbdLnF1yvxnd2lq7wDhVKcwW7qMkhjf6UuNeOO7BPawaPHeE8X4Xx80m5X3FpvU78aEBsP5f0TZwivS13ay7IfkBTGPTIgoH1%2FZvsvfDeoz5330KQKXEvK1pvSWRqz%2FYmiXZBoLnq2Wk136Hn3xShUUk3THpyy0TV1%2Fb5XA%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240828T092620Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAZSJBUEDGOJDPNQ5Y%2F20240828%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=8419105cd8b15a933bc3ad9ac2ccdcadad53478c3284d8b9a6528932e8576cef

I tried the link and got

Connecting to qlever-backups.s3.eu-west-1.amazonaws.com (qlever-backups.s3.eu-west-1.amazonaws.com)|52.218.122.50|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2024-09-02 06:59:31 ERROR 403: Forbidden.
Stiksels commented 2 weeks ago

@hannahbast can you try again with this new link (expires at 22h35 Brussels time):

https://qlever-backups.s3.eu-west-1.amazonaws.com/activiteiten.nq.gz?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEIn%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCWV1LXdlc3QtMSJHMEUCIG4nac1WtrbNQv8Unm0OFIjEZhn5SbBDSn9osFEQBx%2B3AiEA3TqffWEGMi9Rfu6GcLSjVfitBsFkF%2FPMwt4LDVzqOkMq8QIIov%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgw2NTc3MzczOTIzMzIiDBZO%2FFOwnOb8wCTljCrFAnKKM5sdIEuQQCJt5IM6xD5PT8YpZlRjuFXpxvxK0w8%2B65kVbkOYcUcA7JANc2tfCKs4EifgWyGk2NVGRuNVRPCWRf1k4QCvPkrHdLISA%2BlFfblWM4islmcy0MvicMsLSzhFxYW5jQFZxkJ%2Figu8HqP9ITQR3RlsAfOpz1zMXbO3bzd%2F0%2FqDQWXAftSngLXHN8MV%2F9npzSSXprTPel8W3ecFBdHf57okh2ecs1JlatWcHRi33IhdCemGqTumc86bl1%2Ft5hhChSSNapg49L%2B6iE5AzfqUkApOlQIll%2FbI2n0Rhytz6Ko8V2tFGSz8p4ipv8MREy2SbCMXTGMNlb2rzrH5cF0isI3CrZYNFYqS1%2BXvL%2BEtAywTmMJOK9zqBKFAtwZ8%2BtVwT9wyFZ6VHgFLcNqMn5Fg9TM0A11bcNc0XdPn4Y6v%2Btgw4fHVtgY6hwL3arbpzKgU%2BgcBKkQlFLotVuBWfLdnxehpW9W98E3EdSblcopiM%2FJBygRmIQNprTdwk7%2FrCzQlFtjt%2Bd2OlaAKglwzZu3CMtZNa24D%2FcKzOi%2F4S0ybsn9EJyXzrap6YmpZO1HKKPG%2Fm1P3rNldrwzYTP3Oynk3EgRkVtFazAjjDS5V6dd%2BteWjDBbhcqLVZVjOLFWs%2BOyVdcg9itxdEICB21GMOwGi3EoSEn8mYQ%2BcozyspDF5HvWEydNBMiUSasTUmtIrP4WQ79RQ1UABWyrzgiTmyBaANoHIbFsSRFBRVIDVmsc3t3d6123Mi%2FQ2Vkm%2BI5iproOiyCpLlFKNg9LW2tg2OxoYbA%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240902T083606Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAZSJBUEDGB5T5MNG2%2F20240902%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=e1c89a13866d371980cd3e742e91837b91519f4da2fc9751c1bcac9d8ced5b9e