Open Stiksels opened 3 weeks ago
@Stiksels Thanks for reporting this. Can you give us access to the NQ file and your used Settings (QLeverfile or commandline options/settings.json file) for the IndexBuilder, so we can locally reproduce this?
@joka921 it eventually did work, the merging of the partial vocabularies took 3+ hours.
echo '{ "locale": { "language": "nl", "country": "BE", "ignore-punctuation": true }, "ascii-prefixes-only": false, "num-triples-per-batch": 100000 }' > uit-activiteiten-full-nq.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.uit-activiteiten-full-nq docker.io/adfreiburg/qlever:latest -c 'zcat testdata/publiq-uit-activiteiten_2024-08-23_12-30-13.nq.gz | IndexBuilderMain -F nq -f - -i uit-activiteiten-full-nq -s uit-activiteiten-full-nq.settings.json --stxxl-memory 5G | tee uit-activiteiten-full-nq.index-log.txt'
Here is the log: uit-activiteiten-full-nq.index-log.txt
Some additional info, I'm running on an older MacBook Pro model:
In our cloud /K8S setup (amd64), the index build for the compressed nquads file took ~2h in total (faster than the compressed ntriples file)
I'll add a download link for the file shortly
@Stiksels Can you provide a link to the NQ file?
Hi @hannahbast , here is the downloadlink (exp 12h):
https://qlever-backups.s3.eu-west-1.amazonaws.com/activiteiten.nq.gz?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBAaCmV1LW5vcnRoLTEiRjBEAiBoz0XRms9EuYERPrfkcoOoMni%2BmihELxOmzaT7Wh%2Bz7wIgcEA7pj%2BaDVHAFZT1DVkpwz4rCbwJu8XWTwR7yLro%2FTsq6AIIKRAAGgw2NTc3MzczOTIzMzIiDIHtMnfc7yEP%2BrJLTirFAg6ygDKWHKDepEyeGLBH1cNZZdF2uCqYS9LnixT3%2FywT1GhRuSl3CU%2FecRuYYok2Ps0Mzi0yb%2FLIjPCf19RXzRDGtXVphyFE9nmlR3dnuLb%2Byup3Ed6GwUa3B8C18U5O%2FoVSZq7c4iyftb48S92694iQew2LB7rGb2rRO3siBcClGWqVtHctUE%2FdxpVzMCt5omkdEF16xNnULYYXjuNu0tll1zdsLAzhZdw0lXjg9RBZSFPHpquGQOn6HNMjXlmz8FO6EDwwxjdDRCcf5Qwmj3IzOtbIBxrjCc8CAsiZrjv4qMfTCARiARlfDSbbWBoydeSFXQJFBtFRHnWCiejS79kTJITEtA%2Bmi9T%2BcFxC%2BbM8Icod5XlXENlIY4U3h8ednz27itIve7ruYFZt6YNR9eH%2BuotW6c%2BuEfea9wrBF%2FqPhN09tIkwsa%2B7tgY6iAIfatKk6rcVVCO4BPxFybPawxcHXeMIYqhaqszbwzSfCvf491esLVb7CdAf14DHzrCwh%2BM9G6eKSvqkjhFMv3lWuqM7KlItZBSY8u58pa9TAkDs1wnp8mi0B%2FgjfP%2FcYx30Fef5%2BOOR%2Fo0gJZaiuLQB6DAtYHfg%2BV%2BvVVi%2FbFKGHs9CZ8NF4QcwHKXmfbdLnF1yvxnd2lq7wDhVKcwW7qMkhjf6UuNeOO7BPawaPHeE8X4Xx80m5X3FpvU78aEBsP5f0TZwivS13ay7IfkBTGPTIgoH1%2FZvsvfDeoz5330KQKXEvK1pvSWRqz%2FYmiXZBoLnq2Wk136Hn3xShUUk3THpyy0TV1%2Fb5XA%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240828T092620Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAZSJBUEDGOJDPNQ5Y%2F20240828%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=8419105cd8b15a933bc3ad9ac2ccdcadad53478c3284d8b9a6528932e8576cef
Following up on this: I ran qlever index
for this file on a new MacBook Pro M3 / 18GB and the entire process took ~5min
uit-activiteiten-full-nq.index-log (1).txt
Not sure if it's necessary or high prio to support older devices? I put in a request for a new laptop 😂
@Stiksels It's not necessarily about the age of the computer, but about the version of the compiler and maybe the operating system. The merging of the vocabularies handles many files using many threads. It seems that with older compilers and/or older operating systems, the machine code produced does something crazily non-optimal. We haven't figured out exactly what yet.
Hi @hannahbast , here is the downloadlink (exp 12h):
https://qlever-backups.s3.eu-west-1.amazonaws.com/activiteiten.nq.gz?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBAaCmV1LW5vcnRoLTEiRjBEAiBoz0XRms9EuYERPrfkcoOoMni%2BmihELxOmzaT7Wh%2Bz7wIgcEA7pj%2BaDVHAFZT1DVkpwz4rCbwJu8XWTwR7yLro%2FTsq6AIIKRAAGgw2NTc3MzczOTIzMzIiDIHtMnfc7yEP%2BrJLTirFAg6ygDKWHKDepEyeGLBH1cNZZdF2uCqYS9LnixT3%2FywT1GhRuSl3CU%2FecRuYYok2Ps0Mzi0yb%2FLIjPCf19RXzRDGtXVphyFE9nmlR3dnuLb%2Byup3Ed6GwUa3B8C18U5O%2FoVSZq7c4iyftb48S92694iQew2LB7rGb2rRO3siBcClGWqVtHctUE%2FdxpVzMCt5omkdEF16xNnULYYXjuNu0tll1zdsLAzhZdw0lXjg9RBZSFPHpquGQOn6HNMjXlmz8FO6EDwwxjdDRCcf5Qwmj3IzOtbIBxrjCc8CAsiZrjv4qMfTCARiARlfDSbbWBoydeSFXQJFBtFRHnWCiejS79kTJITEtA%2Bmi9T%2BcFxC%2BbM8Icod5XlXENlIY4U3h8ednz27itIve7ruYFZt6YNR9eH%2BuotW6c%2BuEfea9wrBF%2FqPhN09tIkwsa%2B7tgY6iAIfatKk6rcVVCO4BPxFybPawxcHXeMIYqhaqszbwzSfCvf491esLVb7CdAf14DHzrCwh%2BM9G6eKSvqkjhFMv3lWuqM7KlItZBSY8u58pa9TAkDs1wnp8mi0B%2FgjfP%2FcYx30Fef5%2BOOR%2Fo0gJZaiuLQB6DAtYHfg%2BV%2BvVVi%2FbFKGHs9CZ8NF4QcwHKXmfbdLnF1yvxnd2lq7wDhVKcwW7qMkhjf6UuNeOO7BPawaPHeE8X4Xx80m5X3FpvU78aEBsP5f0TZwivS13ay7IfkBTGPTIgoH1%2FZvsvfDeoz5330KQKXEvK1pvSWRqz%2FYmiXZBoLnq2Wk136Hn3xShUUk3THpyy0TV1%2Fb5XA%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240828T092620Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAZSJBUEDGOJDPNQ5Y%2F20240828%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=8419105cd8b15a933bc3ad9ac2ccdcadad53478c3284d8b9a6528932e8576cef
I tried the link and got
Connecting to qlever-backups.s3.eu-west-1.amazonaws.com (qlever-backups.s3.eu-west-1.amazonaws.com)|52.218.122.50|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2024-09-02 06:59:31 ERROR 403: Forbidden.
@hannahbast can you try again with this new link (expires at 22h35 Brussels time):
https://qlever-backups.s3.eu-west-1.amazonaws.com/activiteiten.nq.gz?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEIn%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCWV1LXdlc3QtMSJHMEUCIG4nac1WtrbNQv8Unm0OFIjEZhn5SbBDSn9osFEQBx%2B3AiEA3TqffWEGMi9Rfu6GcLSjVfitBsFkF%2FPMwt4LDVzqOkMq8QIIov%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgw2NTc3MzczOTIzMzIiDBZO%2FFOwnOb8wCTljCrFAnKKM5sdIEuQQCJt5IM6xD5PT8YpZlRjuFXpxvxK0w8%2B65kVbkOYcUcA7JANc2tfCKs4EifgWyGk2NVGRuNVRPCWRf1k4QCvPkrHdLISA%2BlFfblWM4islmcy0MvicMsLSzhFxYW5jQFZxkJ%2Figu8HqP9ITQR3RlsAfOpz1zMXbO3bzd%2F0%2FqDQWXAftSngLXHN8MV%2F9npzSSXprTPel8W3ecFBdHf57okh2ecs1JlatWcHRi33IhdCemGqTumc86bl1%2Ft5hhChSSNapg49L%2B6iE5AzfqUkApOlQIll%2FbI2n0Rhytz6Ko8V2tFGSz8p4ipv8MREy2SbCMXTGMNlb2rzrH5cF0isI3CrZYNFYqS1%2BXvL%2BEtAywTmMJOK9zqBKFAtwZ8%2BtVwT9wyFZ6VHgFLcNqMn5Fg9TM0A11bcNc0XdPn4Y6v%2Btgw4fHVtgY6hwL3arbpzKgU%2BgcBKkQlFLotVuBWfLdnxehpW9W98E3EdSblcopiM%2FJBygRmIQNprTdwk7%2FrCzQlFtjt%2Bd2OlaAKglwzZu3CMtZNa24D%2FcKzOi%2F4S0ybsn9EJyXzrap6YmpZO1HKKPG%2Fm1P3rNldrwzYTP3Oynk3EgRkVtFazAjjDS5V6dd%2BteWjDBbhcqLVZVjOLFWs%2BOyVdcg9itxdEICB21GMOwGi3EoSEn8mYQ%2BcozyspDF5HvWEydNBMiUSasTUmtIrP4WQ79RQ1UABWyrzgiTmyBaANoHIbFsSRFBRVIDVmsc3t3d6123Mi%2FQ2Vkm%2BI5iproOiyCpLlFKNg9LW2tg2OxoYbA%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240902T083606Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAZSJBUEDGB5T5MNG2%2F20240902%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=e1c89a13866d371980cd3e742e91837b91519f4da2fc9751c1bcac9d8ced5b9e
Issue description Trying to build index for a zipped nquads file (~2mio named graphs, ~140mio triples). The proces gets stuck on "Merging partial vocabularies" for over 2hours now...
Logs