apache / age

Graph database optimized for fast analysis and real-time data processing. It is provided as an extension to PostgreSQL.
https://age.apache.org
Apache License 2.0
3.12k stars 411 forks source link

Index is not used in the WHERE clause #1000

Closed vladiksun closed 1 year ago

vladiksun commented 1 year ago

Describe the bug

Index is not used in the WHERE clause.

How are you accessing AGE (Command line, driver, etc.)?

What data setup do we need to do?

LOAD 'age';
SET search_path = ag_catalog, "$user", public;

select ag_catalog.create_graph('test_graph');
select ag_catalog.create_vlabel('test_graph','profile');

-- works only for MATCH because GIN only makes sense for pattern matching what MATCH clause is
CREATE INDEX profile_gin_idx ON test_graph."profile" USING GIN (properties);

-- does not work in the where clause
CREATE INDEX profile_pet_btree_idx1 ON test_graph."profile" USING BTREE ((properties -> 'pet'));

-- does not work in the where clause
CREATE INDEX profile_pet_btree_idx2 ON test_graph."profile" USING BTREE (ag_catalog.agtype_access_operator(properties, '"pet"'::ag_catalog.agtype));

------ generate more data -------
DO
$do$
    BEGIN
        FOR i IN 1..10000 LOOP
            EXECUTE format('
                select * from ag_catalog.cypher(''test_graph'',
                $$
                    CREATE (any_vertex: profile { `id`: "%s", `pet`: "%s", `hidden`: %s })
                    RETURN any_vertex
                $$
            ) as (any_vertex ag_catalog.agtype)',
                   (SELECT uuid_in(md5(random()::text || now()::text)::cstring)),
                   (SELECT ('[0:2]={dog,cat,bird}'::text[])[floor(random()*3)]),
                   (SELECT ('[0:1]={true,false}'::text[])[floor(random()*2)])
                );
        END LOOP;
    END
$do$;

What is the command that caused the error?

select
    any_profile
from ag_catalog.cypher('test_graph',$$
    EXPLAIN ANALYZE MATCH (any_profile:`profile` { hidden: false })
    WHERE any_profile.pet = 'dog'
    RETURN any_profile
$$
) as (any_profile ag_catalog.agtype);
QUERY PLAN
Bitmap Heap Scan on profile any_profile (cost=20.08..52.40 rows=1 width=32) (actual time=1.111..22.752 rows=1616 loops=1)
Recheck Cond: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
Filter: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('17463'::oid, id), properties), '"pet"'::agtype]) = '"dog"'::agtype)
Rows Removed by Filter: 3384
Heap Blocks: exact=143
-> Bitmap Index Scan on profile_gin_idx (cost=0.00..20.08 rows=10 width=0) (actual time=1.010..1.010 rows=5000 loops=1)
Index Cond: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
Planning Time: 0.733 ms
Execution Time: 22.869 ms

Expected behavior Either of two indexes should be used

Environment (please complete the following information):

Additional context

As we see from the plan the filter function is applied as: \ Filter: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('17463'::oid, id), properties), '"pet"'::agtype]) = '"dog"'::agtype)

We also tried to simulate the functional index like this with no luck because there is a mutable function involved: \

CREATE INDEX profile_pet_btree_idx3 ON test_graph."profile" USING BTREE (
ag_catalog.agtype_access_operator(VARIADIC ARRAY[age_properties(_agtype_build_vertex(id, _label_name('16945'::oid, id), properties)), '"pet"'::ag_catalog.agtype])
);

Could it be possible to add an index support for the WHERE clause at least via any functional index by not involving mutable functions ?

jrgemignani commented 1 year ago

There is a patch in the works that "may" address this. But, others are welcome to look into it.

vladiksun commented 1 year ago

@jrgemignani Is there a branch for the patch you mentioned ?

jrgemignani commented 1 year ago

@vladiksun The patch I'm referring to isn't in a branch atm. Currently, it is under review for completeness.

jrgemignani commented 1 year ago

@MuhammadTahaNaveed @Zainab-Saad Could you look into whether the patch (Josh's) you are reviewing/reworking might address this or not?

MuhammadTahaNaveed commented 1 year ago

@MuhammadTahaNaveed @Zainab-Saad Could you look into whether the patch (Josh's) you are reviewing/reworking might address this or not?

We checked and it doesnt seem that patch solves this issue.

jrgemignani commented 1 year ago

@vladiksun We are still looking into this, it's been a busy past month.

jrgemignani commented 1 year ago

edited: removed. I realized it wasn't what I thought it was. The index, apparently isn't used until there is a threshold reached.

jrgemignani commented 1 year ago

@vladiksun I was able to get an index working for WHERE -

psql-15.3-5432-pgsql=# select
    any_profile
from ag_catalog.cypher('test_graph',$$
    EXPLAIN ANALYZE MATCH (any_profile:`profile` { hidden: false })
    WHERE any_profile.pet = 'dog'
    RETURN any_profile
$$
) as (any_profile ag_catalog.agtype);
                                                                                                                QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on profile any_profile  (cost=53.67..57.70 rows=1 width=32) (actual time=69.444..425.885 rows=33347 loops=1)
   Recheck Cond: ((agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"
'::agtype]) = '"dog"'::agtype) AND (properties @> agtype_build_map('hidden'::text, 'false'::agtype)))
   Heap Blocks: exact=2858
   ->  BitmapAnd  (cost=53.67..53.67 rows=1 width=0) (actual time=68.563..68.564 rows=0 loops=1)
         ->  Bitmap Index Scan on profile_pet_btree_idx5  (cost=0.00..23.92 rows=1000 width=0) (actual time=38.343..38.343 rows=6666
6 loops=1)
               Index Cond: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties
), '"pet"'::agtype]) = '"dog"'::agtype)
         ->  Bitmap Index Scan on profile_gin_idx  (cost=0.00..29.50 rows=200 width=0) (actual time=29.980..29.980 rows=99961 loops=
1)
               Index Cond: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
 Planning Time: 1.457 ms
 Execution Time: 430.522 ms
(10 rows)

I still need to do other tests to verify that the changes don't cause any side effects. There are 2 issues here -

vladiksun commented 1 year ago

@jrgemignani thank you for taking a look. Indeed without creating a functional index there is no way to check if that actuality might work. If you were able to create such an index, one thing seems suspicious to me. With an index the actual execution time is longer than without it. But this probably depends on the setup. For basic queries like searching the vertex by its property we decided not to use Cypher calls, but instead usual SQL queries with functional indexes around properties values. It is much faster and we can use native FTS this way. I believe this is possible because agtype is a superset of a jsonb hence we can use it like jsonb. But for traversal queries the index should help reduce the filtering time of incoming or outgoing vertices. For example queries with scenarios - find all outgoing vertices with label 'A' from the start vertex label 'B' where some 'A''s property IN [value1, value2]. I assume with more hops involved ('A's could be chained in a hierarchy) such queries take longer without index on a specific property.

jrgemignani commented 1 year ago

@vladiksun There are 2 things that will cause a difference in execution time. The first is that I modified your for loop to 100000. The second is that I ran it locally on a small server of mine. Tomorrow I will try to give you before and after times with your original query.

jrgemignani commented 1 year ago

@vladiksun Here are the results using your original query.

Before adding 3rd index

psql-15.3-5432-pgsql=# select
    any_profile
from ag_catalog.cypher('test_graph',$$
    EXPLAIN ANALYZE MATCH (any_profile:`profile` { hidden: false })
    WHERE any_profile.pet = 'dog'
    RETURN any_profile
$$
) as (any_profile ag_catalog.agtype);
                                                                         QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
------------------------
 Bitmap Heap Scan on profile any_profile  (cost=20.08..52.40 rows=1 width=32) (actual time=1.763..47.451 rows=1721 loops=1)
   Recheck Cond: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
   Filter: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"'::agty
pe]) = '"dog"'::agtype)
   Rows Removed by Filter: 3308
   Heap Blocks: exact=143
   ->  Bitmap Index Scan on profile_gin_idx  (cost=0.00..20.08 rows=10 width=0) (actual time=1.653..1.653 rows=5029 loops=1)
         Index Cond: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
 Planning Time: 2.109 ms
 Execution Time: 47.805 ms
(9 rows)

Adding 3rd index

psql-15.3-5432-pgsql=# CREATE INDEX profile_pet_btree_idx3 ON test_graph."profile" USING BTREE (
agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"'::agtype]));
CREATE INDEX

After adding 3rd index

psql-15.3-5432-pgsql=# select
    any_profile
from ag_catalog.cypher('test_graph',$$
    EXPLAIN ANALYZE MATCH (any_profile:`profile` { hidden: false })
    WHERE any_profile.pet = 'dog'
    RETURN any_profile
$$
) as (any_profile ag_catalog.agtype);
                                                                                                                QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on profile any_profile  (cost=24.99..29.02 rows=1 width=32) (actual time=5.618..24.466 rows=1721 loops=1)
   Recheck Cond: ((agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"
'::agtype]) = '"dog"'::agtype) AND (properties @> agtype_build_map('hidden'::text, 'false'::agtype)))
   Heap Blocks: exact=143
   ->  BitmapAnd  (cost=24.99..24.99 rows=1 width=0) (actual time=5.534..5.534 rows=0 loops=1)
         ->  Bitmap Index Scan on profile_pet_btree_idx3  (cost=0.00..4.66 rows=50 width=0) (actual time=4.081..4.081 rows=3416 loop
s=1)
               Index Cond: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties
), '"pet"'::agtype]) = '"dog"'::agtype)
         ->  Bitmap Index Scan on profile_gin_idx  (cost=0.00..20.08 rows=10 width=0) (actual time=1.439..1.439 rows=5029 loops=1)
               Index Cond: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
 Planning Time: 1.436 ms
 Execution Time: 24.784 ms
(10 rows)

psql-15.3-5432-pgsql=#
vladiksun commented 1 year ago

@jrgemignani this looks promising, thank you. By the way can you get rid of using label id while creating the index definition? As I recall we've got this whole function call from the execution plan as it is. I assume this might involve changing the main logic so the code doesn't use label ID either. But when used with liquebase in the kubernetes environment this means we should precalculate the label id before we use it which is not convinient for schema initialization scripts. Ideally the index creation scripts should use predefined data if possible.

jrgemignani commented 1 year ago

@vladiksun In the following query -

select
    any_profile
from ag_catalog.cypher('test_graph',$$
    EXPLAIN ANALYZE MATCH (any_profile:`profile` { hidden: false })
    WHERE any_profile.pet = 'dog'
    RETURN any_profile
$$
) as (any_profile ag_catalog.agtype);

The cypher component -

MATCH (any_profile:`profile` { hidden: false })
WHERE any_profile.pet = 'dog'
RETURN any_profile;

any_profile is a vertex

In your first index below, I'm not sure if this has been set up. We will need to debug this specific index to find out how it is resolved.

-- does not work in the where clause
CREATE INDEX profile_pet_btree_idx1 ON test_graph."profile" USING BTREE ((properties -> 'pet'));

In your second index below, agtype_access_operator aka . understands how to work with whatever value is sent to it. The properties() function isn't used as agtype_access_operator has to be able to accept any object or array. So, any . operation just takes the variable used.

-- does not work in the where clause
CREATE INDEX profile_pet_btree_idx2 ON test_graph."profile" USING BTREE (ag_catalog.agtype_access_operator(properties, '"pet"'::ag_catalog.agtype));

That is why this index worked -

CREATE INDEX profile_pet_btree_idx3 ON test_graph."profile" USING BTREE (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"'::agtype]));

I should note, this is after adjusting the volatility flag for the functions used in the index.

Our team needs to look into changing the volatility flag for the functions used in this index and similar ones for edges. Part of the problem originally was that some wanted to blanketly change that flag without bothering to truly understand if it can be changed. Changing the flag obviously works but does it pose other issues?

For you, as a temporary fix, all you need to do is change the volatility flag of the functions involved in age--1.3.0.sql and then rebuild the extension. Or, possibly, alter their entries through sql.

We will be looking into the issue with the volatility flag to see if it can be updated with a patch, sometime this week.

Hopefully, this is helpful.

jrgemignani commented 1 year ago

By the way can you get rid of using label id while creating the index definition?

((agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"
'::agtype]) = '"dog"'::agtype) AND (properties @> agtype_build_map('hidden'::text, 'false'::agtype)))

It might be possible to reduce the first into something similar to the second. However, we would have to look into the logic of how this is being resolved and passed.

Right now, our main focus has been on completing the cypher specification, making the logic more consistent, fixing bugs, and bringing Apache AGE to PostgreSQL version 15. Once we get to PostgreSQL version 15 we will have more resources to look more deeply into other issues, like indexes. Btw, I'm not trying to sound dismissive of your issue(s), we just have very limited resources qualified to focus on issues like these, at this moment.

15 should be ready in about a week and hopefully we can add in fixes for the volatility flag with it.

jrgemignani commented 1 year ago

@vladiksun updated comment above

jrgemignani commented 1 year ago

@vladiksun I created PR #1133 to address the volatility flags that blocks the index from being created on the WHERE clause. When this patch is merged, it will only be in the master branch - until it is pushed to the other branches.

I am also looking at removing _label_name function call, but that may not be possible or worthwhile.

jrgemignani commented 1 year ago

@vladiksun The PR has been applied to the master branch. So, your indexes on the WHERE should work now, provided you use the master branch.

jrgemignani commented 1 year ago

@vladiksun For the _label_name function. While the function could be eliminated, the numeric value passed (which I'm assuming is what is problematic for you) wouldn't be. If I were to eliminate it altogether, then the index wouldn't likely have anything to key on.

Btw, the way a label name is stored and retrieved is via the graph's OID (which graph the vertex or edge belongs to) and a part of the vertex or edge ID (the ID contains a unique ID plus the ID of the label table it belongs to). A vertex or edge is not stored with its label name. That is why it needs to be constructed to build in memory representation.

vladiksun commented 1 year ago

@jrgemignani thank you. We will try this fix any time soon. As for the graph OID I believe we could get this OID by graph name from apache age tables and dynamically construct index creation scripts.

vladiksun commented 1 year ago

@jrgemignani unfortunately we were not able to test it because master branch does not compile for some reason. We usualy build release distributions inside docker based on postgres docker images.

Probably something was broken recently because about a month ago master was compilable.

Created a ticket for this: https://github.com/apache/age/issues/1140

vladiksun commented 1 year ago

@jrgemignani Looks like master moved to postgres 15, we compiled the master and started checking. https://github.com/apache/age/issues/1140 closed.

jrgemignani commented 1 year ago

@vladiksun Yes, master just moved to PostgreSQL version 15 on Friday. If you prefer to use a different version, there are branches for each PG11, PG12, PG13, & PG14.

vladiksun commented 1 year ago

@jrgemignani indexes work fine for cypher equality operator but we were not able to use the index for Cypher IN operator.

For example:

ALTER FUNCTION agtype_in_operator IMMUTABLE;
CREATE INDEX profile_pet_btree_check_in_operator ON test_graph."profile" USING BTREE (
    ag_catalog.agtype_in_operator('["dog", "cat"]'::agtype, agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('17580'::oid, id), properties), '"pet"'::agtype]))
);

The query:

select
    any_profile
from ag_catalog.cypher('test_graph',$$
    EXPLAIN ANALYZE
    MATCH (any_profile:`profile` { hidden: false })
    WHERE any_profile.pet IN ['dog', 'cat']
    RETURN any_profile
$$
) as (any_profile ag_catalog.agtype);

For some reason the optimizer does not involve the index even if we try to multiple data ten times. Is it safe to change the volatility flag for agtype_in_operator function ?

QUERY PLAN
Gather (cost=1000.00..11311.91 rows=133 width=32) (actual time=0.616..345.966 rows=133994 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on profile any_profile (cost=0.00..10298.61 rows=55 width=32) (actual time=0.254..331.683 rows=44665 loops=3)
Filter: ((properties @> agtype_build_map('hidden'::text, 'false'::agtype)) AND agtype_in_operator('["dog", "cat"]'::agtype, agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('17580'::oid, id), properties), '"pet"'::agtype])))
Rows Removed by Filter: 88669
Planning Time: 0.295 ms
Execution Time: 350.271 ms
jrgemignani commented 1 year ago

@vladiksun I will look into this tomorrow.

jrgemignani commented 1 year ago

@vladiksun Unfortunately, this one is a bit more complex and will take some time to understand why the index won't take. From what I've seen, the nodes are of two different types: FuncExpr and Var. Because of this, it won't match. I need to understand why they are that way.

edit: It may also be an issue with a support function not being available.

jrgemignani commented 1 year ago

@vladiksun It looks like, in order to support an index on the IN operator, there needs to be a "support" function. This is in addition to the IMMUTABLE flag.

vladiksun commented 1 year ago

@jrgemignani Does it mean it requires additional development ?

jrgemignani commented 1 year ago

@vladiksun From what I've found, I believe so. Every time I debug the index -

psql-15.3-5432-pgsql=# CREATE INDEX profile_pet_btree_check_in_operator ON test_graph."profile" USING BTREE (
agtype_in_operator('["dog", "cat"]'::agtype, agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('16954'::oid, id), properties), '"pet"'::agtype])));

the function that creates the index paths - create_index_paths - returns NULL partly because there isn't a support function defined for it. These functions are defined in pg_procand assigned under the prosupport column.

Assuming a support function would help here, we would need to research how to create one; the documentation is rather limited and so are the examples.

It is possible there is another path that could create the index. However, this particular path is the one that matches but returns NULL due to the lack of a support function.

Of course, this is all after changing the volatility flag to IMMUTABLE for agtype_in_operator.

Right now, I have to turn my attention to the Apache release process for AGE for the next few weeks. So, I won't be able to give this issue as much time until after the releases are out.

Hope this is helpful.

dehowef commented 1 year ago

@vladiksun I'll be looking into the implementation of the support function for b-tree indexing. I've been looking at how indexing works in AGE for quite a bit on and off. Hopefully we can fix this up soon. Thank you for communicating with us~

vladiksun commented 1 year ago

@dehowef thanks for taking a look at this.

dehowef commented 1 year ago

@vladiksun I have been looking into how B-tree indexing works and indexing in general, and I came across documentation that shows that B-tree indexes rely on operators (=,<,>,<=,>=) to work. Since the IN operator filter as shown by EXPLAIN ANALYZE is a function call, I feel that this might contribute to why the index is not being used.

The agtype operator class is defined in the age--x.x.x.sql file, maybe looking there might offer further insight. Just thought I'd share what I've been looking into with you. Hope this might be helpful.

vladiksun commented 1 year ago

@dehowef Thank you for paying attention to this. Right now, the "==" operator also involves functional calls, so the only way to kick out the index is to reverse engineer the execution plan and extract the function itself in order to create the index.

Ideally, the WHERE should not use functional indexes at all, like MATCH does not for GIN indexes, for two reasons:

dehowef commented 1 year ago

@vladiksun As it stands, a decent portion of AGE's logic is implemented with functions as you can see 😅

With this knowledge in mind, I think it's worth having a discussion with the community on this and how it relates to indexing and what considerations should be made in light of these discoveries.

Were there other instances where these functional indexes have been giving you a hard time?

vladiksun commented 1 year ago

@dehowef I believe for now only IN operator. But thanks to @jrgemignani the fix for "=" operator is already in version 1.4.0

rafsun42 commented 1 year ago

@vladiksun I tried recreating the 'IN' query using the 'ANY' operator, but in SQL (instead of Cypher).

EXPLAIN ANALYZE
SELECT * FROM test_graph.profile
WHERE
    agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('49745'::oid, id), properties), '"pet"'::agtype])
            = ANY(ARRAY['"dog"'::agtype, '"cat"'::agtype])
    AND properties @> agtype_build_map('hidden'::text, 'false'::agtype);

I get this query plan:

Bitmap Heap Scan on profile  (cost=9.32..148.87 rows=1 width=40) (actual time=13.369..59.118 rows=3326 loops=1)
   Recheck Cond: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('49745'::oid, id), properties), '"pet"'::agtype]) = ANY ('{"\"dog\"","\"cat\""}'::agtype[]))
   Filter: (properties @> agtype_build_map('hidden'::text, 'false'::agtype))
   Rows Removed by Filter: 3296
   Heap Blocks: exact=143
   ->  Bitmap Index Scan on profile_pet_btree_idx31  (cost=0.00..9.32 rows=100 width=0) (actual time=13.189..13.191 rows=6622 loops=1)
         Index Cond: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(id, _label_name('49745'::oid, id), properties), '"pet"'::agtype]) = ANY ('{"\"dog\"","\"cat\""}'::agtype[]))
 Planning Time: 0.667 ms
 Execution Time: 59.693 ms
(9 rows)

Let me know if this SQL query is what you are looking for and if it uses index on your machine. We may have a patch that transforms IN into an ANY. We can work on integrating that into master.

vladiksun commented 1 year ago

@rafsun42 thank you. We will check this next week and let you know.

vladiksun commented 1 year ago

@rafsun42 we checked. It works on my machine. This could be a good solution

jrgemignani commented 1 year ago

@vladiksun I have another PR to enhance performance and indexes #1232

Also, I want to point out that, for whatever reason, your particular dataset and query does best without using a GIN index. I've checked our GIN index components and they are largely the same as PG's JSONB GIN components, so I'm thinking it's your dataset that just doesn't work well with a GIN index. I did find that a BTREE index, by itself, works best for your dataset. Not a BitmapAnd of the GIN and BTREE, which fell between GIN and BTREE performance gains.

On my server, the above PR cut the query time, for your queries above, in half without using any indexes. Adding the GIN index only gave a very minor improvement. The BTREE index (*_idx2) cut the query time in half again.

vladiksun commented 1 year ago

@jrgemignani Thanks for sharing the PR. I have built that branch to see how it could affect our typical queries. It turns out that for our use cases, we do not see much difference. Usually, our typical queries are like this:

select properties from ag_catalog.cypher('test_graph',
                                         $$
//EXPLAIN ANALYZE
MATCH (start_vertex:`scope` {hidden: false})
WHERE start_vertex.name = 'scope_1'
OR start_vertex.name = 'scope_2'
OR start_vertex.name = 'scope_3'
OR start_vertex.name = 'scope_4'
OR start_vertex.name = 'scope_5'
OR start_vertex.name = 'scope_6'

RETURN {
    properties: properties(start_vertex),
    ___edge_reserved: properties(null)
}

UNION

MATCH (from:`scope` {hidden: false})
-[e1:`edges`*0..5 {hidden: false}]->(to1:`role` {hidden: false})
-[e2:`edges`*1..1 {hidden: false, fromCollection: 'role', toCollection: 'permission'}]->(to2:`permission` {hidden: false})
-[e3:`edges`*1..1 {hidden: false, fromCollection: 'permission', toCollection: 'resource'}]->(to3:`resource` {hidden: false})

WHERE from.name = 'scope_1'
OR from.name = 'scope_2'
OR from.name = 'scope_3'
OR from.name = 'scope_4'
OR from.name = 'scope_5'
OR from.name = 'scope_6'

WITH
collect({ properties: properties(to1), ___edge_reserved: properties(last(e1)) })
+ collect({ properties: properties(to2), ___edge_reserved: properties(last(e2)) })
+ collect({ properties: properties(to3), ___edge_reserved: properties(last(e3)) }) as result

UNWIND result as result_out
RETURN DISTINCT result_out

LIMIT 1000
$$) as (properties ag_catalog.agtype);

As you can see we use only one table (e_label) for edges and structures like

{hidden: false, fromCollection: 'role', toCollection: 'permission'}

improve the execution time because there is a call to a GIN index that cuts unnecessary edges. That also helps simplify backend logic.

So, such a query:

select properties from ag_catalog.cypher('test_graph',
                                         $$
//EXPLAIN ANALYZE
MATCH (from:`domain` {hidden: false})-[any_edge:`edges`*1..10 {hidden: false, fromCollection: 'domain', toCollection: 'domain', relation: 'managed'}]->(to:`domain` {hidden: false, id: 'domain_root_1'})
WITH from, any_edge, to, head(any_edge) AS edge_check_in, head(any_edge) AS edge_result_in

RETURN DISTINCT {
    properties: properties(from),
    ___edge_reserved: properties(edge_result_in)
}
LIMIT 1000

$$) as (properties ag_catalog.agtype);

works even faster than the version with lots of edge tables.


select properties from ag_catalog.cypher('test_graph',
                                         $$
//EXPLAIN ANALYZE
MATCH (from:`domain` {hidden: false})-[any_edge:`E_DOMAIN_MANAGED_DOMAIN`*1..10 {hidden: false}]->(to:`domain` {hidden: false, id: 'domain_root_1'})
WITH from, any_edge, to, head(any_edge) AS edge_check_in, head(any_edge) AS edge_result_in

RETURN DISTINCT {
    properties: properties(from),
    ___edge_reserved: properties(edge_result_in)
}
LIMIT 1000

$$) as (properties ag_catalog.agtype);
``` 
jrgemignani commented 1 year ago

@vladiksun We should probably look into these later queries separately. My tests were only against your query at the top of this thread and only on my server. If the issue with using WHERE (unable to create an index or index isn't working) is resolved? we should probably close this thread and create one for performance improvements and link this thread.

Thoughts?

vladiksun commented 1 year ago

@jrgemignani I believe @rafsun42 also suggested the solution for IN operator. I can move the issue with "IN" operator to another ticket. Let me know if it is convenient to do so or leave this issue open until the solution for "IN" operator is implemented.

jrgemignani commented 1 year ago

@vladiksun We probably should keep the issues separate but link them. Really long threads can lose their readability, especially if they diverge too much. Also, having a long thread with many issues gives the impression that it is just one issue that is being worked on, when it can be many more.

vladiksun commented 1 year ago

@jrgemignani, @rafsun42 ok I will create a new issue for the proposed solution for IN operator and link it to the this one. Thank you.

jrgemignani commented 1 year ago

@vladiksun Yes, @rafsun42 and @dehowef have both been working on PRs for improving WHERE performance. Hopefully, we'll have them all out by the end of the week.