Parsing triples (nquads) with docker image ae095dab7896 significantly slower

ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.

Apache License 2.0

424 stars 52 forks source link

Parsing triples (nquads) with docker image ae095dab7896 significantly slower #1563

Closed Stiksels closed 1 month ago

Stiksels commented 1 month ago

REPOSITORY                                                    TAG                 IMAGE ID       CREATED        SIZE
adfreiburg/qlever                                             latest              ae095dab7896   16 hours ago   764MB
adfreiburg/qlever                                             previous            32a21873ce56   7 days ago     761MB

Speed with tag:latest 2024-10-17 20:15:07.644 - INFO: Triples parsed: 10,000,000 [average speed 0.3 M/s, last batch 0.3 M/s, fastest 0.3 M/s, slowest 0.3M/s]

Speed with tag:previous 2024-10-17 20:20:48.589 - INFO: Triples parsed: 10,000,000 [average speed 1.1 M/s, last batch 1.1 M/s, fastest 1.1 M/s, slowest 1.1 M/s]

hannahbast commented 1 month ago

@Stiksels Please set "parallel-parsing": true in your SETTINGS_JSON. The default has been changed to false. We had reasons for that, but it was still a bad idea, because most or all of the preconfigured Qleverfiles don't set parallel-parsing explicitly, and neither to most of Qleverfiles out there.

Stiksels commented 1 month ago

thanks @hannahbast that was the problem, with parallel-parsing set to true, the parsing speed is back up at 1.0M/s.

I see the following INFO and WARNING lines now: 2024-10-18 05:22:45.530 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files with a well-behaved use of newlines 2024-10-18 05:22:45.530 - WARN: Parallel parsing set totruein the.settings.jsonfile; this is deprecated, please use the command-line option --parse-parallel or -p instead

The issue from #1468 still remains, however: processing multiple large nquad files gets stuck on "merging partial vocabularies"