Basic scan relationships limit query takes a very long time (as long as it takes to scan the whole database)

I am playing around with the wikidata-lexemes.ttl file, which contains ~60M triples (20M resource and 40M literal) and is a ~12GB database stored on disk. I am running the following query:

match (s:wklx_r)-[p]->(o) return s.iri, p.iri, o.iri, o.val limit 10; Which takes ~12 seconds. In an ideal world, this query should run instantaneously but it's not that trivial. This is an interesting case. The current plan does the following steps:

Read all s (resource) nodes and then scans their relationships, so both _rt and _lt relationship tables. That is 60 triples. Then join these with their p.iris from the Resource node table (which is a table with 20M nodes). The result table is also 60M in length. Then join these with the o.iri and o.val values. Then it applies LIMIT 10. In a very smart system, we could do the following:

First push the LIMIT to scan of s (resource) nodes and then scans their relationships. So scan only 10 relationships. Do the step 2 above but push the ids of the joining nodes as semijoin filters. Do the step 3 above but push the ids of the joining relationships as semijoin filters. I am not sure if we can get to this very intelligent optimization at least without very explicit rules just for this case. It seems to require a lot of reasoning but if we swapped the order of the joins, so first joined with o.iri and o.val, then we should be able to push the LIMIT down to after this first hash join and then use a semijoin filter when joining with p.iris. The reason is that we can reason that p.iri hash join cannot remove tuples. So every tuple after the 1st (swapped) hashjoin has to join with a successful p.iri. Therefore the LIMIT can be pushed.

Note: If I remove the p.iri from the projection the time reduces to 8.7s. If I remove o.val it's still around 8s. If I remove o.iri as well then it is instantenous.

Summary: Query involving scanning relationships takes a long time due to inefficient processing and lack of optimization.

Possible Solution

Based on the provided information, the issue is related to the performance of a specific query on the Kùzu graph database. The query takes approximately 12 seconds to execute, which is considered slow given the context. The current execution plan reads all s (resource) nodes, scans _rt and _lt relationship tables, joins p.iri from the Resource node table, joins o.iri and o.val values, and finally applies a LIMIT 10. The suggested optimization steps include pushing the LIMIT to the scan of s nodes and their relationships and using semijoin filters for joining nodes and relationships.

To address the issue, the following solution should be implemented:

Modify the query execution engine to apply the LIMIT clause as early as possible in the execution plan. This means that the LIMIT should be pushed down to the initial scan of s nodes and their relationships, effectively reducing the number of tuples processed in subsequent steps.
Implement semijoin filters to optimize the join operations. Semijoins can reduce the size of intermediate results by filtering out non-matching tuples early in the join process.
Review the query optimizer to ensure that it is capable of recognizing when these optimizations can be applied and to automatically adjust the execution plan accordingly.
If the query optimizer is not currently capable of these optimizations, enhance its capabilities by adding rules or heuristics that identify when a LIMIT can be pushed down and when semijoin filters can be used effectively.
Test the optimized query execution with various projections (removing p.iri, o.val, and o.iri) to measure the performance impact and ensure that the optimizer's decisions lead to actual performance improvements.

The provided code snippets do not contain direct implementations of the query execution engine or the optimizer for the Kùzu graph database. Therefore, specific code changes cannot be recommended without access to the relevant parts of the codebase. However, the general approach outlined above should guide the necessary modifications to the query execution engine and optimizer components.

Code snippets to check

test/optimizer/optimizer_test.cpp

Contains tests for the optimizer, which is relevant to the issue of optimizing the query execution plan to push down the LIMIT clause and apply semi-join filters.
src/binder/bind/copy/bind_copy_rdf_graph.cpp

Contains logic for binding RDF graph copy operations, which may be relevant to the issue if the query execution involves scanning RDF triples and could benefit from optimization.

Mayil-AI-Sandbox / kuzudb_jan15

Basic scan relationships limit query takes a very long time (as long as it takes to scan the whole database) #62

Possible Solution

Code snippets to check