Open meido opened 4 months ago
Summary: Query involving scanning relationships takes a long time due to inefficient processing and lack of optimization.
Based on the provided information, the issue is related to the performance of a specific query on the Kùzu graph database. The query takes approximately 12 seconds to execute, which is considered slow given the context. The current execution plan reads all s
(resource) nodes, scans _rt
and _lt
relationship tables, joins p.iri
from the Resource node table, joins o.iri
and o.val
values, and finally applies a LIMIT 10
. The suggested optimization steps include pushing the LIMIT
to the scan of s
nodes and their relationships and using semijoin filters for joining nodes and relationships.
To address the issue, the following solution should be implemented:
LIMIT
clause as early as possible in the execution plan. This means that the LIMIT
should be pushed down to the initial scan of s
nodes and their relationships, effectively reducing the number of tuples processed in subsequent steps.LIMIT
can be pushed down and when semijoin filters can be used effectively.p.iri
, o.val
, and o.iri
) to measure the performance impact and ensure that the optimizer's decisions lead to actual performance improvements.The provided code snippets do not contain direct implementations of the query execution engine or the optimizer for the Kùzu graph database. Therefore, specific code changes cannot be recommended without access to the relevant parts of the codebase. However, the general approach outlined above should guide the necessary modifications to the query execution engine and optimizer components.
test/optimizer/optimizer_test.cpp
Contains tests for the optimizer, which is relevant to the issue of optimizing the query execution plan to push down the LIMIT clause and apply semi-join filters.
src/binder/bind/copy/bind_copy_rdf_graph.cpp
Contains logic for binding RDF graph copy operations, which may be relevant to the issue if the query execution involves scanning RDF triples and could benefit from optimization.
I am playing around with the wikidata-lexemes.ttl file, which contains ~60M triples (20M resource and 40M literal) and is a ~12GB database stored on disk. I am running the following query:
match (s:wklx_r)-[p]->(o) return s.iri, p.iri, o.iri, o.val limit 10; Which takes ~12 seconds. In an ideal world, this query should run instantaneously but it's not that trivial. This is an interesting case. The current plan does the following steps:
Read all s (resource) nodes and then scans their relationships, so both _rt and _lt relationship tables. That is 60 triples. Then join these with their p.iris from the Resource node table (which is a table with 20M nodes). The result table is also 60M in length. Then join these with the o.iri and o.val values. Then it applies LIMIT 10. In a very smart system, we could do the following:
First push the LIMIT to scan of s (resource) nodes and then scans their relationships. So scan only 10 relationships. Do the step 2 above but push the ids of the joining nodes as semijoin filters. Do the step 3 above but push the ids of the joining relationships as semijoin filters. I am not sure if we can get to this very intelligent optimization at least without very explicit rules just for this case. It seems to require a lot of reasoning but if we swapped the order of the joins, so first joined with o.iri and o.val, then we should be able to push the LIMIT down to after this first hash join and then use a semijoin filter when joining with p.iris. The reason is that we can reason that p.iri hash join cannot remove tuples. So every tuple after the 1st (swapped) hashjoin has to join with a successful p.iri. Therefore the LIMIT can be pushed.
Note: If I remove the p.iri from the projection the time reduces to 8.7s. If I remove o.val it's still around 8s. If I remove o.iri as well then it is instantenous.