Closed vladiksun closed 1 year ago
There is a patch in the works that "may" address this. But, others are welcome to look into it.
@jrgemignani Is there a branch for the patch you mentioned ?
@vladiksun The patch I'm referring to isn't in a branch atm. Currently, it is under review for completeness.
@MuhammadTahaNaveed @Zainab-Saad Could you look into whether the patch (Josh's) you are reviewing/reworking might address this or not?
@MuhammadTahaNaveed @Zainab-Saad Could you look into whether the patch (Josh's) you are reviewing/reworking might address this or not?
We checked and it doesnt seem that patch solves this issue.
@vladiksun We are still looking into this, it's been a busy past month.
edited: removed. I realized it wasn't what I thought it was. The index, apparently isn't used until there is a threshold reached.
@vladiksun I was able to get an index working for WHERE -
psql-15.3-5432-pgsql=# select
any_profile
from ag_catalog.cypher('test_graph',$$
EXPLAIN ANALYZE MATCH (any_profile:`profile` { hidden: false })
WHERE any_profile.pet = 'dog'
RETURN any_profile
$$
) as (any_profile ag_catalog.agtype);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on profile any_profile (cost=53.67..57.70 rows=1 width=32) (actual time=69.444..425.885 rows=33347 loops=1)
Recheck Cond: ((agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"
'::agtype]) = '"dog"'::agtype) AND (properties @> agtype_build_map('hidden'::text, 'false'::agtype)))
Heap Blocks: exact=2858
-> BitmapAnd (cost=53.67..53.67 rows=1 width=0) (actual time=68.563..68.564 rows=0 loops=1)
-> Bitmap Index Scan on profile_pet_btree_idx5 (cost=0.00..23.92 rows=1000 width=0) (actual time=38.343..38.343 rows=6666
6 loops=1)
Index Cond: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties
), '"pet"'::agtype]) = '"dog"'::agtype)
-> Bitmap Index Scan on profile_gin_idx (cost=0.00..29.50 rows=200 width=0) (actual time=29.980..29.980 rows=99961 loops=
1)
Index Cond: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
Planning Time: 1.457 ms
Execution Time: 430.522 ms
(10 rows)
I still need to do other tests to verify that the changes don't cause any side effects. There are 2 issues here -
profile_pet_btree_idx3
was not correct. But, without being able to create it, you wouldn't know.@jrgemignani thank you for taking a look. Indeed without creating a functional index there is no way to check if that actuality might work. If you were able to create such an index, one thing seems suspicious to me. With an index the actual execution time is longer than without it. But this probably depends on the setup. For basic queries like searching the vertex by its property we decided not to use Cypher calls, but instead usual SQL queries with functional indexes around properties values. It is much faster and we can use native FTS this way. I believe this is possible because agtype is a superset of a jsonb hence we can use it like jsonb. But for traversal queries the index should help reduce the filtering time of incoming or outgoing vertices. For example queries with scenarios - find all outgoing vertices with label 'A' from the start vertex label 'B' where some 'A''s property IN [value1, value2]. I assume with more hops involved ('A's could be chained in a hierarchy) such queries take longer without index on a specific property.
@vladiksun There are 2 things that will cause a difference in execution time. The first is that I modified your for loop to 100000. The second is that I ran it locally on a small server of mine. Tomorrow I will try to give you before and after times with your original query.
@vladiksun Here are the results using your original query.
Before adding 3rd index
psql-15.3-5432-pgsql=# select
any_profile
from ag_catalog.cypher('test_graph',$$
EXPLAIN ANALYZE MATCH (any_profile:`profile` { hidden: false })
WHERE any_profile.pet = 'dog'
RETURN any_profile
$$
) as (any_profile ag_catalog.agtype);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
------------------------
Bitmap Heap Scan on profile any_profile (cost=20.08..52.40 rows=1 width=32) (actual time=1.763..47.451 rows=1721 loops=1)
Recheck Cond: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
Filter: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"'::agty
pe]) = '"dog"'::agtype)
Rows Removed by Filter: 3308
Heap Blocks: exact=143
-> Bitmap Index Scan on profile_gin_idx (cost=0.00..20.08 rows=10 width=0) (actual time=1.653..1.653 rows=5029 loops=1)
Index Cond: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
Planning Time: 2.109 ms
Execution Time: 47.805 ms
(9 rows)
Adding 3rd index
psql-15.3-5432-pgsql=# CREATE INDEX profile_pet_btree_idx3 ON test_graph."profile" USING BTREE (
agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"'::agtype]));
CREATE INDEX
After adding 3rd index
psql-15.3-5432-pgsql=# select
any_profile
from ag_catalog.cypher('test_graph',$$
EXPLAIN ANALYZE MATCH (any_profile:`profile` { hidden: false })
WHERE any_profile.pet = 'dog'
RETURN any_profile
$$
) as (any_profile ag_catalog.agtype);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on profile any_profile (cost=24.99..29.02 rows=1 width=32) (actual time=5.618..24.466 rows=1721 loops=1)
Recheck Cond: ((agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"
'::agtype]) = '"dog"'::agtype) AND (properties @> agtype_build_map('hidden'::text, 'false'::agtype)))
Heap Blocks: exact=143
-> BitmapAnd (cost=24.99..24.99 rows=1 width=0) (actual time=5.534..5.534 rows=0 loops=1)
-> Bitmap Index Scan on profile_pet_btree_idx3 (cost=0.00..4.66 rows=50 width=0) (actual time=4.081..4.081 rows=3416 loop
s=1)
Index Cond: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties
), '"pet"'::agtype]) = '"dog"'::agtype)
-> Bitmap Index Scan on profile_gin_idx (cost=0.00..20.08 rows=10 width=0) (actual time=1.439..1.439 rows=5029 loops=1)
Index Cond: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
Planning Time: 1.436 ms
Execution Time: 24.784 ms
(10 rows)
psql-15.3-5432-pgsql=#
@jrgemignani this looks promising, thank you. By the way can you get rid of using label id while creating the index definition? As I recall we've got this whole function call from the execution plan as it is. I assume this might involve changing the main logic so the code doesn't use label ID either. But when used with liquebase in the kubernetes environment this means we should precalculate the label id before we use it which is not convinient for schema initialization scripts. Ideally the index creation scripts should use predefined data if possible.
@vladiksun In the following query -
select
any_profile
from ag_catalog.cypher('test_graph',$$
EXPLAIN ANALYZE MATCH (any_profile:`profile` { hidden: false })
WHERE any_profile.pet = 'dog'
RETURN any_profile
$$
) as (any_profile ag_catalog.agtype);
The cypher component -
MATCH (any_profile:`profile` { hidden: false })
WHERE any_profile.pet = 'dog'
RETURN any_profile;
any_profile is a vertex
In your first index below, I'm not sure if this has been set up. We will need to debug this specific index to find out how it is resolved.
-- does not work in the where clause CREATE INDEX profile_pet_btree_idx1 ON test_graph."profile" USING BTREE ((properties -> 'pet'));
In your second index below,
agtype_access_operator
aka.
understands how to work with whatever value is sent to it. Theproperties()
function isn't used asagtype_access_operator
has to be able to accept any object or array. So, any.
operation just takes the variable used.-- does not work in the where clause CREATE INDEX profile_pet_btree_idx2 ON test_graph."profile" USING BTREE (ag_catalog.agtype_access_operator(properties, '"pet"'::ag_catalog.agtype));
That is why this index worked -
CREATE INDEX profile_pet_btree_idx3 ON test_graph."profile" USING BTREE (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"'::agtype]));
I should note, this is after adjusting the volatility flag for the functions used in the index.
Our team needs to look into changing the volatility flag for the functions used in this index and similar ones for edges. Part of the problem originally was that some wanted to blanketly change that flag without bothering to truly understand if it can be changed. Changing the flag obviously works but does it pose other issues?
For you, as a temporary fix, all you need to do is change the volatility flag of the functions involved in age--1.3.0.sql
and then rebuild the extension. Or, possibly, alter their entries through sql.
We will be looking into the issue with the volatility flag to see if it can be updated with a patch, sometime this week.
Hopefully, this is helpful.
By the way can you get rid of using label id while creating the index definition?
((agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet" '::agtype]) = '"dog"'::agtype) AND (properties @> agtype_build_map('hidden'::text, 'false'::agtype)))
It might be possible to reduce the first into something similar to the second. However, we would have to look into the logic of how this is being resolved and passed.
Right now, our main focus has been on completing the cypher specification, making the logic more consistent, fixing bugs, and bringing Apache AGE to PostgreSQL version 15. Once we get to PostgreSQL version 15 we will have more resources to look more deeply into other issues, like indexes. Btw, I'm not trying to sound dismissive of your issue(s), we just have very limited resources qualified to focus on issues like these, at this moment.
15 should be ready in about a week and hopefully we can add in fixes for the volatility flag with it.
@vladiksun updated comment above
@vladiksun I created PR #1133 to address the volatility flags that blocks the index from being created on the WHERE clause. When this patch is merged, it will only be in the master branch - until it is pushed to the other branches.
I am also looking at removing _label_name
function call, but that may not be possible or worthwhile.
@vladiksun The PR has been applied to the master branch. So, your indexes on the WHERE should work now, provided you use the master branch.
@vladiksun For the _label_name
function. While the function could be eliminated, the numeric value passed (which I'm assuming is what is problematic for you) wouldn't be. If I were to eliminate it altogether, then the index wouldn't likely have anything to key on.
Btw, the way a label name is stored and retrieved is via the graph's OID (which graph the vertex or edge belongs to) and a part of the vertex or edge ID (the ID contains a unique ID plus the ID of the label table it belongs to). A vertex or edge is not stored with its label name. That is why it needs to be constructed to build in memory representation.
@jrgemignani thank you. We will try this fix any time soon. As for the graph OID I believe we could get this OID by graph name from apache age tables and dynamically construct index creation scripts.
@jrgemignani unfortunately we were not able to test it because master branch does not compile for some reason. We usualy build release distributions inside docker based on postgres docker images.
Probably something was broken recently because about a month ago master was compilable.
Created a ticket for this: https://github.com/apache/age/issues/1140
@jrgemignani Looks like master moved to postgres 15, we compiled the master and started checking. https://github.com/apache/age/issues/1140 closed.
@vladiksun Yes, master just moved to PostgreSQL version 15 on Friday. If you prefer to use a different version, there are branches for each PG11, PG12, PG13, & PG14.
@jrgemignani indexes work fine for cypher equality operator but we were not able to use the index for Cypher IN operator.
For example:
ALTER FUNCTION agtype_in_operator IMMUTABLE;
CREATE INDEX profile_pet_btree_check_in_operator ON test_graph."profile" USING BTREE (
ag_catalog.agtype_in_operator('["dog", "cat"]'::agtype, agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('17580'::oid, id), properties), '"pet"'::agtype]))
);
The query:
select
any_profile
from ag_catalog.cypher('test_graph',$$
EXPLAIN ANALYZE
MATCH (any_profile:`profile` { hidden: false })
WHERE any_profile.pet IN ['dog', 'cat']
RETURN any_profile
$$
) as (any_profile ag_catalog.agtype);
For some reason the optimizer does not involve the index even if we try to multiple data ten times.
Is it safe to change the volatility flag for agtype_in_operator
function ?
QUERY PLAN |
---|
Gather (cost=1000.00..11311.91 rows=133 width=32) (actual time=0.616..345.966 rows=133994 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Seq Scan on profile any_profile (cost=0.00..10298.61 rows=55 width=32) (actual time=0.254..331.683 rows=44665 loops=3) |
Filter: ((properties @> agtype_build_map('hidden'::text, 'false'::agtype)) AND agtype_in_operator('["dog", "cat"]'::agtype, agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('17580'::oid, id), properties), '"pet"'::agtype]))) |
Rows Removed by Filter: 88669 |
Planning Time: 0.295 ms |
Execution Time: 350.271 ms |
@vladiksun I will look into this tomorrow.
@vladiksun Unfortunately, this one is a bit more complex and will take some time to understand why the index won't take. From what I've seen, the nodes are of two different types: FuncExpr and Var. Because of this, it won't match. I need to understand why they are that way.
edit: It may also be an issue with a support function not being available.
@vladiksun It looks like, in order to support an index on the IN operator, there needs to be a "support" function. This is in addition to the IMMUTABLE flag.
@jrgemignani Does it mean it requires additional development ?
@vladiksun From what I've found, I believe so. Every time I debug the index -
psql-15.3-5432-pgsql=# CREATE INDEX profile_pet_btree_check_in_operator ON test_graph."profile" USING BTREE (
agtype_in_operator('["dog", "cat"]'::agtype, agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"'::agtype])));
the function that creates the index paths - create_index_paths
- returns NULL partly because there isn't a support function defined for it. These functions are defined in pg_proc
and assigned under the prosupport
column.
Assuming a support function would help here, we would need to research how to create one; the documentation is rather limited and so are the examples.
It is possible there is another path that could create the index. However, this particular path is the one that matches but returns NULL due to the lack of a support function.
Of course, this is all after changing the volatility flag to IMMUTABLE for agtype_in_operator
.
Right now, I have to turn my attention to the Apache release process for AGE for the next few weeks. So, I won't be able to give this issue as much time until after the releases are out.
Hope this is helpful.
@vladiksun I'll be looking into the implementation of the support function for b-tree indexing. I've been looking at how indexing works in AGE for quite a bit on and off. Hopefully we can fix this up soon. Thank you for communicating with us~
@dehowef thanks for taking a look at this.
@vladiksun I have been looking into how B-tree indexing works and indexing in general, and I came across documentation that shows that B-tree indexes rely on operators (=,<,>,<=,>=) to work. Since the IN operator filter as shown by EXPLAIN ANALYZE is a function call, I feel that this might contribute to why the index is not being used.
The agtype operator class is defined in the age--x.x.x.sql file, maybe looking there might offer further insight. Just thought I'd share what I've been looking into with you. Hope this might be helpful.
@dehowef Thank you for paying attention to this. Right now, the "==" operator also involves functional calls, so the only way to kick out the index is to reverse engineer the execution plan and extract the function itself in order to create the index.
Ideally, the WHERE should not use functional indexes at all, like MATCH does not for GIN indexes, for two reasons:
@vladiksun As it stands, a decent portion of AGE's logic is implemented with functions as you can see 😅
With this knowledge in mind, I think it's worth having a discussion with the community on this and how it relates to indexing and what considerations should be made in light of these discoveries.
Were there other instances where these functional indexes have been giving you a hard time?
@dehowef I believe for now only IN operator. But thanks to @jrgemignani the fix for "=" operator is already in version 1.4.0
@vladiksun I tried recreating the 'IN' query using the 'ANY' operator, but in SQL (instead of Cypher).
EXPLAIN ANALYZE
SELECT * FROM test_graph.profile
WHERE
agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('49745'::oid, id), properties), '"pet"'::agtype])
= ANY(ARRAY['"dog"'::agtype, '"cat"'::agtype])
AND properties @> agtype_build_map('hidden'::text, 'false'::agtype);
I get this query plan:
Bitmap Heap Scan on profile (cost=9.32..148.87 rows=1 width=40) (actual time=13.369..59.118 rows=3326 loops=1)
Recheck Cond: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('49745'::oid, id), properties), '"pet"'::agtype]) = ANY ('{"\"dog\"","\"cat\""}'::agtype[]))
Filter: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
Rows Removed by Filter: 3296
Heap Blocks: exact=143
-> Bitmap Index Scan on profile_pet_btree_idx31 (cost=0.00..9.32 rows=100 width=0) (actual time=13.189..13.191 rows=6622 loops=1)
Index Cond: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('49745'::oid, id), properties), '"pet"'::agtype]) = ANY ('{"\"dog\"","\"cat\""}'::agtype[]))
Planning Time: 0.667 ms
Execution Time: 59.693 ms
(9 rows)
Let me know if this SQL query is what you are looking for and if it uses index on your machine. We may have a patch that transforms IN into an ANY. We can work on integrating that into master.
@rafsun42 thank you. We will check this next week and let you know.
@rafsun42 we checked. It works on my machine. This could be a good solution
@vladiksun I have another PR to enhance performance and indexes #1232
Also, I want to point out that, for whatever reason, your particular dataset and query does best without using a GIN index. I've checked our GIN index components and they are largely the same as PG's JSONB GIN components, so I'm thinking it's your dataset that just doesn't work well with a GIN index. I did find that a BTREE index, by itself, works best for your dataset. Not a BitmapAnd of the GIN and BTREE, which fell between GIN and BTREE performance gains.
On my server, the above PR cut the query time, for your queries above, in half without using any indexes. Adding the GIN index only gave a very minor improvement. The BTREE index (*_idx2) cut the query time in half again.
@jrgemignani Thanks for sharing the PR. I have built that branch to see how it could affect our typical queries. It turns out that for our use cases, we do not see much difference. Usually, our typical queries are like this:
select properties from ag_catalog.cypher('test_graph',
$$
//EXPLAIN ANALYZE
MATCH (start_vertex:`scope` {hidden: false})
WHERE start_vertex.name = 'scope_1'
OR start_vertex.name = 'scope_2'
OR start_vertex.name = 'scope_3'
OR start_vertex.name = 'scope_4'
OR start_vertex.name = 'scope_5'
OR start_vertex.name = 'scope_6'
RETURN {
properties: properties(start_vertex),
___edge_reserved: properties(null)
}
UNION
MATCH (from:`scope` {hidden: false})
-[e1:`edges`*0..5 {hidden: false}]->(to1:`role` {hidden: false})
-[e2:`edges`*1..1 {hidden: false, fromCollection: 'role', toCollection: 'permission'}]->(to2:`permission` {hidden: false})
-[e3:`edges`*1..1 {hidden: false, fromCollection: 'permission', toCollection: 'resource'}]->(to3:`resource` {hidden: false})
WHERE from.name = 'scope_1'
OR from.name = 'scope_2'
OR from.name = 'scope_3'
OR from.name = 'scope_4'
OR from.name = 'scope_5'
OR from.name = 'scope_6'
WITH
collect({ properties: properties(to1), ___edge_reserved: properties(last(e1)) })
+ collect({ properties: properties(to2), ___edge_reserved: properties(last(e2)) })
+ collect({ properties: properties(to3), ___edge_reserved: properties(last(e3)) }) as result
UNWIND result as result_out
RETURN DISTINCT result_out
LIMIT 1000
$$) as (properties ag_catalog.agtype);
As you can see we use only one table (e_label) for edges and structures like
{hidden: false, fromCollection: 'role', toCollection: 'permission'}
improve the execution time because there is a call to a GIN index that cuts unnecessary edges. That also helps simplify backend logic.
So, such a query:
select properties from ag_catalog.cypher('test_graph',
$$
//EXPLAIN ANALYZE
MATCH (from:`domain` {hidden: false})-[any_edge:`edges`*1..10 {hidden: false, fromCollection: 'domain', toCollection: 'domain', relation: 'managed'}]->(to:`domain` {hidden: false, id: 'domain_root_1'})
WITH from, any_edge, to, head(any_edge) AS edge_check_in, head(any_edge) AS edge_result_in
RETURN DISTINCT {
properties: properties(from),
___edge_reserved: properties(edge_result_in)
}
LIMIT 1000
$$) as (properties ag_catalog.agtype);
works even faster than the version with lots of edge tables.
select properties from ag_catalog.cypher('test_graph',
$$
//EXPLAIN ANALYZE
MATCH (from:`domain` {hidden: false})-[any_edge:`E_DOMAIN_MANAGED_DOMAIN`*1..10 {hidden: false}]->(to:`domain` {hidden: false, id: 'domain_root_1'})
WITH from, any_edge, to, head(any_edge) AS edge_check_in, head(any_edge) AS edge_result_in
RETURN DISTINCT {
properties: properties(from),
___edge_reserved: properties(edge_result_in)
}
LIMIT 1000
$$) as (properties ag_catalog.agtype);
```
@vladiksun We should probably look into these later queries separately. My tests were only against your query at the top of this thread and only on my server. If the issue with using WHERE (unable to create an index or index isn't working) is resolved? we should probably close this thread and create one for performance improvements and link this thread.
Thoughts?
@jrgemignani I believe @rafsun42 also suggested the solution for IN operator. I can move the issue with "IN" operator to another ticket. Let me know if it is convenient to do so or leave this issue open until the solution for "IN" operator is implemented.
@vladiksun We probably should keep the issues separate but link them. Really long threads can lose their readability, especially if they diverge too much. Also, having a long thread with many issues gives the impression that it is just one issue that is being worked on, when it can be many more.
@jrgemignani, @rafsun42 ok I will create a new issue for the proposed solution for IN operator and link it to the this one. Thank you.
@vladiksun Yes, @rafsun42 and @dehowef have both been working on PRs for improving WHERE performance. Hopefully, we'll have them all out by the end of the week.
Describe the bug
Index is not used in the WHERE clause.
How are you accessing AGE (Command line, driver, etc.)?
What data setup do we need to do?
What is the command that caused the error?
Expected behavior Either of two indexes should be used
Environment (please complete the following information):
Additional context
As we see from the plan the filter function is applied as: \
Filter: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('17463'::oid, id), properties), '"pet"'::agtype]) = '"dog"'::agtype)
We also tried to simulate the functional index like this with no luck because there is a mutable function involved: \
Could it be possible to add an index support for the WHERE clause at least via any functional index by not involving mutable functions ?