Mayil-AI-Sandbox / kuzudb_jan15

MIT License
0 stars 0 forks source link

Basic scan relationships limit query takes a very long time (as long as it takes to scan the whole database) #62

Open meido opened 4 months ago

meido commented 4 months ago

I am playing around with the wikidata-lexemes.ttl file, which contains ~60M triples (20M resource and 40M literal) and is a ~12GB database stored on disk. I am running the following query:

match (s:wklx_r)-[p]->(o) return s.iri, p.iri, o.iri, o.val limit 10; Which takes ~12 seconds. In an ideal world, this query should run instantaneously but it's not that trivial. This is an interesting case. The current plan does the following steps:

Read all s (resource) nodes and then scans their relationships, so both _rt and _lt relationship tables. That is 60 triples. Then join these with their p.iris from the Resource node table (which is a table with 20M nodes). The result table is also 60M in length. Then join these with the o.iri and o.val values. Then it applies LIMIT 10. In a very smart system, we could do the following:

First push the LIMIT to scan of s (resource) nodes and then scans their relationships. So scan only 10 relationships. Do the step 2 above but push the ids of the joining nodes as semijoin filters. Do the step 3 above but push the ids of the joining relationships as semijoin filters. I am not sure if we can get to this very intelligent optimization at least without very explicit rules just for this case. It seems to require a lot of reasoning but if we swapped the order of the joins, so first joined with o.iri and o.val, then we should be able to push the LIMIT down to after this first hash join and then use a semijoin filter when joining with p.iris. The reason is that we can reason that p.iri hash join cannot remove tuples. So every tuple after the 1st (swapped) hashjoin has to join with a successful p.iri. Therefore the LIMIT can be pushed.

Note: If I remove the p.iri from the projection the time reduces to 8.7s. If I remove o.val it's still around 8s. If I remove o.iri as well then it is instantenous.

mayil-ai[bot] commented 4 months ago

Summary: Query involving scanning relationships takes a long time due to inefficient processing and lack of optimization.

Possible Solution

Based on the provided information, the issue is related to the performance of a specific query on the Kùzu graph database. The query takes approximately 12 seconds to execute, which is considered slow given the context. The current execution plan reads all s (resource) nodes, scans _rt and _lt relationship tables, joins p.iri from the Resource node table, joins o.iri and o.val values, and finally applies a LIMIT 10. The suggested optimization steps include pushing the LIMIT to the scan of s nodes and their relationships and using semijoin filters for joining nodes and relationships.

To address the issue, the following solution should be implemented:

The provided code snippets do not contain direct implementations of the query execution engine or the optimizer for the Kùzu graph database. Therefore, specific code changes cannot be recommended without access to the relevant parts of the codebase. However, the general approach outlined above should guide the necessary modifications to the query execution engine and optimizer components.

Code snippets to check