apache / age

Graph database optimized for fast analysis and real-time data processing. It is provided as an extension to PostgreSQL.
https://age.apache.org
Apache License 2.0
3.16k stars 413 forks source link

MATCH Query Always Using Sequential Scan, Ignoring Index Scan #2137

Open pritish-moharir opened 1 day ago

pritish-moharir commented 1 day ago

Hi everyone,

I'm encountering an issue with Apache AGE where a complex MATCH query always defaults to using a sequential scan, even though indexes exist on the queried columns. Disabling sequential scans via SET enable_seqscan=off has no effect, and the query plan i.e. explain analyze output continues to show a sequential scan.

Query Example:

Here's a simplified version of the query we are using:

SELECT * FROM cypher('graph_name', $$  
MATCH (n1:NodeType1)  
WHERE n1.attribute1 = '<value1>'  
  AND n1.attribute2 IN ('<value2>')  
WITH n1  
OPTIONAL MATCH (n1)-[:RelType1_NodeType1]-(n2:NodeType2)  
WITH n1, n2  
OPTIONAL MATCH (n1)-[:RelType2_NodeType1]-(n3:NodeType3)  
WITH n1, n2, n3  
OPTIONAL MATCH (n1)-[:RelType3_NodeType1]-(n4:NodeType4)  
RETURN DISTINCT n1 AS Node1, n2 AS Node2, n3 AS Node3, n4 AS Node4  
$$) AS (result_column agtype);

Data Setup:

We have populated the graph with data using queries like the following :

SELECT * FROM cypher('graph_name', $$  
MERGE (n:NodeType1 {key1: "value1"})  
SET n.property1 = "value1",  
    n.property2 = "value2",  
    n.property3 = "value3",   
    ....
$$) AS (result_column agtype); 

Problem:

The query plan indicates that a sequential scan is being used on NodeType1 and other nodes, despite indexes being present on attribute1 and attribute2. For performance, we expect the query to utilize the indexes for an index scan.

Observed Behavior:

The query consistently uses sequential scans. Setting _enableseqscan = off doesn't change the behavior.

Expected Behavior:

The query should leverage the indexes on NodeType1.attribute1 and NodeType1.attribute2 to perform an index scan.

Environment Details:

We are running a containerised apache age docker image on k8s. Apache AGE version: release_PG16_1.5.0 PostgreSQL version: 16 K8S Version: v1.29.6

What We've Tried:

Questions:

Why does the MATCH query ignore available indexes and use sequential scans? Are there any specific configurations or query optimizations required to enable index scans in Apache AGE for graph queries? Could this be a limitation or a bug in Apache AGE? Any help or insights from the community would be greatly appreciated!

If additional information, logs, or examples are needed, please let me know.

Thank you!

MuhammadTahaNaveed commented 1 day ago

@pritish-moharir

You have to create index on the expression used by age to access a certain property.

issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH (n1:NodeType1)  
WHERE n1.name = 'node1'  
RETURN n1
$$) AS (result agtype);
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on "NodeType1" n1
   Filter: (agtype_access_operator(VARIADIC ARRAY[properties, '"name"'::agtype]) = '"node1"'::agtype)
(2 rows)
issue_2137=# CREATE INDEX idx_btree_name
ON test."NodeType1"
USING btree (agtype_access_operator(VARIADIC ARRAY[properties, '"name"'::agtype]));
CREATE INDEX
issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH (n1:NodeType1)  
WHERE n1.name = 'node1'  
RETURN n1                                                                          
$$) AS (result agtype);
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Index Scan using idx_btree_name on "NodeType1" n1
   Index Cond: (agtype_access_operator(VARIADIC ARRAY[properties, '"name"'::agtype]) = '"node1"'::agtype)
(2 rows)

If you want to utilize gin index, you need to use the filter like MATCH (n {att1: 'value1'}), since containment operator in where clause is not supported as of now. Below is an example:

issue_2137=# CREATE INDEX idx_gin                                                        
ON test."NodeType1"
USING gin (properties);
CREATE INDEX
issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH (n1:NodeType1 {name: "Node1"}) 
RETURN n1
$$) AS (result agtype);
                           QUERY PLAN                            
-----------------------------------------------------------------
 Bitmap Heap Scan on "NodeType1" n1
   Recheck Cond: (properties @> '{"name": "Node1"}'::agtype)
   ->  Bitmap Index Scan on idx_gin
         Index Cond: (properties @> '{"name": "Node1"}'::agtype)
(4 rows)

Also, querying undirected paths can really slow down the performance, so consider using directed paths in MATCH clause wherever possible. I hope this helps.

neerajx86 commented 19 hours ago

@MuhammadTahaNaveed Thanks for the reply, I believe having this documented would help the community.

Also, is there a documentation for agtype_ functions as well?

pritish-moharir commented 15 hours ago

Hey @MuhammadTahaNaveed thank you for the quick response, we were able to successfully create and use index scan for one of the label properties for ex "NodeType1".name in this case.

A few questions we still have are :

  1. The other labels like "NodeType2", "NodeType3" are still running sequential scans, so we were wondering if there is a way to index on the joins that are happening?

  2. One of the queries we are using is like below:

SELECT * FROM cypher('graph_name', $$  
MATCH (n1:NodeType1)  
WHERE n1.attribute1 = '<value1>'  
  AND n1.attribute2 IN ['<value2>']  
WITH n1  
OPTIONAL MATCH (n1)-[:RelType1_NodeType1*1..1]-(n2:NodeType2)  
WITH n1, n2  
OPTIONAL MATCH (n1)-[:RelType2_NodeType1*1..1]-(n3:NodeType3)  
WITH n1, n2, n3  
OPTIONAL MATCH (n1)-[:RelType3_NodeType1*1..1]-(n4:NodeType4)  
WITH n1, n2, n3, n4  
OPTIONAL MATCH (n2)-[:RelType4_NodeType2*1..1]-(n5:NodeType5)  
WITH n1, n2, n3, n4, n5  
OPTIONAL MATCH (n3)-[:RelType5_NodeType3*1..1]-(n6:NodeType6)  
WITH n1, n2, n3, n4, n5, n6  
OPTIONAL MATCH (n5)-[:RelType6_NodeType5*1..1]-(n7:NodeType7)  
WITH n1, n2, n3, n4, n5, n6, n7  
WHERE n7.attribute3 = '<value3>'  
  AND n6.attribute4 = '<value4>'  
  AND n4.attribute5 = '<value5>'  
WITH DISTINCT(n1) AS n1, n1.attributeKey AS n1_key  
ORDER BY n1.attributeKey SKIP 0 LIMIT 10  
OPTIONAL MATCH (n1)-[:RelType7_NodeType1*1..1]-(n8:NodeType8)  
WITH DISTINCT n8 AS n8, n1, n1.attributeKey AS n1_key, n8.attributeKey AS n8_key  
ORDER BY n1.attributeKey, n8.attributeKey SKIP 0 LIMIT 10  
OPTIONAL MATCH (n1)-[:RelType8_NodeType1*1..1]-(n9:NodeType9)  
WITH DISTINCT n9 AS n9, n1, n8, n1.attributeKey AS n1_key, n8.attributeKey AS n8_key, n9.attributeKey AS n9_key  
ORDER BY n1.attributeKey, n8.attributeKey, n9.attributeKey SKIP 0 LIMIT 10  
OPTIONAL MATCH (n1)-[:RelType9_NodeType1*1..1]-(n10:NodeType10)-[:RelType10_NodeType10*1..1]-(n11:NodeType11)  
WITH DISTINCT n11 AS n11, n1, n8, n9, n1.attributeKey AS n1_key, n8.attributeKey AS n8_key, n9.attributeKey AS n9_key, n11.attributeKey AS n11_key  
ORDER BY n1.attributeKey, n8.attributeKey, n9.attributeKey, n11.attributeKey SKIP 0 LIMIT 10  
RETURN DISTINCT n1 AS NodeType1, n8 AS NodeType8, n9 AS NodeType9, n11 AS NodeType11  
$$) AS (n1 agtype, n8 agtype, n9 agtype, n11 agtype);

Its a bit complex but we were able to run the index scan for NodeType1 attribute1 and attribute2, but even after creating index on the other NodeType attributes which are also queried, they are running sequential scan. Is there something we are missing, do you have any suggestions as to what fields can be indexed here? I apologise if this is a very straightforward question but any inputs will be really helpful!

Please let me know if you need any additional information. Thanks!

MuhammadTahaNaveed commented 7 hours ago

@pritish-moharir You need to create index on id column of node label, and index on start_id and end_id of edge label. PR #2117 does that by default btw.

issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH p=(:NodeType1)-[:RelType1_NodeType1]->(:NodeType2) 
RETURN p
$$) AS (result agtype);
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Hash Join
   Hash Cond: (_age_default_alias_1.end_id = _age_default_alias_2.id)
   ->  Hash Join
         Hash Cond: (_age_default_alias_1.start_id = _age_default_alias_0.id)
         ->  Seq Scan on "RelType1_NodeType1" _age_default_alias_1
         ->  Hash
               ->  Seq Scan on "NodeType1" _age_default_alias_0
   ->  Hash
         ->  Seq Scan on "NodeType2" _age_default_alias_2
(9 rows)
issue_2137=# CREATE UNIQUE INDEX idx_n1_id ON test."NodeType1" USING btree (id);
CREATE INDEX
issue_2137=# CREATE UNIQUE INDEX idx_n2_id ON test."NodeType2" USING btree (id);
CREATE INDEX
issue_2137=# CREATE INDEX idx_r1_id ON test."RelType1_NodeType1" USING btree (start_id, end_id);
CREATE INDEX
issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH p=(:NodeType1)-[:RelType1_NodeType1]->(:NodeType2) 
RETURN p
$$) AS (result agtype);
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Hash Join
   Hash Cond: (_age_default_alias_1.end_id = _age_default_alias_2.id)
   ->  Merge Join
         Merge Cond: (_age_default_alias_0.id = _age_default_alias_1.start_id)
         ->  Index Scan using idx_n1_id on "NodeType1" _age_default_alias_0
         ->  Index Scan using idx_r1_id on "RelType1_NodeType1" _age_default_alias_1
   ->  Hash
         ->  Index Scan using idx_n2_id on "NodeType2" _age_default_alias_2
(8 rows)

Also, I would like to reiterate that querying undirected paths can really slow down the performance, so consider using directed paths in MATCH clauses wherever possible.

@neerajx86 The agtype_ functions are primarily intended for internal use. You can find a list of available functions in the documentation. The documentation is a bit outdated but covers most of the available functions.