Simple query takes long time to run

Kobaruon commented 4 months ago

This simple query takes about 3 minutes to run. Is this normal behaviour? Is there a room for improvement here in terms of indexing or rewriting the query?

SELECT * from cypher('passv2', $$
        MATCH (V)-[R:WORKED_ON]-(V2)
        RETURN V,R,V2
        LIMIT 100
$$) as (V agtype, R agtype, V2 agtype);

                                                                                Table "passv2.WORKED_ON"
   Column   |        Type        | Collation | Nullable |                                                            Default
------------+--------------------+-----------+----------+--------------------------------------------------------------------------------------------------------------------------------
 id         | ag_catalog.graphid |           | not null | ag_catalog._graphid(ag_catalog._label_id('passv2'::name, 'WORKED_ON'::name)::integer, nextval('"WORKED_ON_id_seq"'::regclass))
 start_id   | ag_catalog.graphid |           | not null |
 end_id     | ag_catalog.graphid |           | not null |
 properties | ag_catalog.agtype  |           | not null | ag_catalog.agtype_build_map()
Indexes:
    "WORKED_ON_end_id_start_id_idx" btree (end_id, start_id)
    "WORKED_ON_id_idx" btree (id)
    "WORKED_ON_properties_idx" gin (properties)
    "WORKED_ON_start_id_end_id_idx" btree (start_id, end_id)
    "idx_worked_on_end_id" btree (end_id)
    "idx_worked_on_start_id" btree (start_id)
Inherits: _ag_label_edge

Explain Analyze

``` QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=1000.57..30222613.91 rows=100 width=96) (actual time=180465.331..182428.507 rows=100 loops=1) Buffers: shared hit=202943877 -> Gather (cost=1000.57..414641535.58 rows=1372 width=96) (actual time=180465.329..182428.494 rows=100 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=202943877 -> Nested Loop (cost=0.57..414640398.38 rows=572 width=96) (actual time=61699.042..92998.472 rows=53 loops=3) Buffers: shared hit=202943877 -> Nested Loop (cost=0.00..24546776.28 rows=224054050 width=472) (actual time=0.022..9394.517 rows=16767051 loops=3) Buffers: shared hit=1738966 -> Parallel Append (cost=0.00..1355.14 rows=9663 width=236) (actual time=0.008..0.928 rows=724 loops=3) Buffers: shared hit=49 -> Parallel Seq Scan on pod "V_6" (cost=0.00..997.72 rows=10791 width=257) (actual time=0.005..0.501 rows=681 loops=1) Buffers: shared hit=26 -> Parallel Seq Scan on mars "V_3" (cost=0.00..165.77 rows=1992 width=199) (actual time=0.007..0.020 rows=18 loops=2) Buffers: shared hit=2 -> Parallel Seq Scan on "DNE" "V_2" (cost=0.00..40.30 rows=810 width=52) (actual time=0.004..1.122 rows=1377 loops=1) Buffers: shared hit=16 -> Parallel Seq Scan on earth "V_5" (cost=0.00..2.78 rows=26 width=162) (actual time=0.002..0.034 rows=44 loops=1) Buffers: shared hit=2 -> Parallel Seq Scan on venus "V_4" (cost=0.00..1.55 rows=18 width=220) (actual time=0.003..0.026 rows=31 loops=1) Buffers: shared hit=1 -> Parallel Seq Scan on "euCentral01" "V_7" (cost=0.00..1.07 rows=2 width=59) (actual time=0.001..0.003 rows=4 loops=1) Buffers: shared hit=1 -> Parallel Seq Scan on _ag_label_vertex "V_1" (cost=0.00..1.02 rows=1 width=35) (actual time=0.003..0.004 rows=1 loops=1) Buffers: shared hit=1 -> Parallel Seq Scan on host "V_8" (cost=0.00..0.00 rows=1 width=79) (actual time=0.001..0.001 rows=0 loops=1) -> Append (cost=0.00..1844.47 rows=23189 width=236) (actual time=0.002..7.452 rows=23148 loops=2173) Buffers: shared hit=1738917 -> Seq Scan on _ag_label_vertex "V2_1" (cost=0.00..1.03 rows=1 width=35) (actual time=0.002..0.002 rows=1 loops=2173) Buffers: shared hit=2173 -> Seq Scan on "DNE" "V2_2" (cost=0.00..57.31 rows=1377 width=52) (actual time=0.002..0.191 rows=1377 loops=2173) Buffers: shared hit=34768 -> Seq Scan on mars "V2_3" (cost=0.00..207.61 rows=3387 width=199) (actual time=0.003..0.576 rows=3386 loops=2173) Buffers: shared hit=230245 -> Seq Scan on venus "V2_4" (cost=0.00..1.93 rows=31 width=220) (actual time=0.004..0.007 rows=31 loops=2172) Buffers: shared hit=2172 -> Seq Scan on earth "V2_5" (cost=0.00..3.32 rows=44 width=162) (actual time=0.002..0.008 rows=44 loops=2172) Buffers: shared hit=4344 -> Seq Scan on pod "V2_6" (cost=0.00..1224.32 rows=18344 width=257) (actual time=0.001..3.281 rows=18314 loops=2172) Buffers: shared hit=1463045 -> Seq Scan on "euCentral01" "V2_7" (cost=0.00..1.12 rows=4 width=59) (actual time=0.002..0.002 rows=4 loops=2170) Buffers: shared hit=2170 -> Seq Scan on host "V2_8" (cost=0.00..0.00 rows=1 width=79) (actual time=0.000..0.000 rows=0 loops=2170) -> Bitmap Heap Scan on "WORKED_ON" "R" (cost=0.57..1.71 rows=1 width=51) (actual time=0.003..0.003 rows=0 loops=50301154) Recheck Cond: (((start_id = "V".id) AND (end_id = "V2".id)) OR ((start_id = "V2".id) AND (end_id = "V".id))) Heap Blocks: exact=1 Buffers: shared hit=201204731 -> BitmapOr (cost=0.57..0.57 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=50301154) Buffers: shared hit=201204624 -> Bitmap Index Scan on "WORKED_ON_start_id_end_id_idx" (cost=0.00..0.29 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=50301154) Index Cond: ((start_id = "V".id) AND (end_id = "V2".id)) Buffers: shared hit=100602316 -> Bitmap Index Scan on "WORKED_ON_start_id_end_id_idx" (cost=0.00..0.29 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=50301154) Index Cond: ((start_id = "V2".id) AND (end_id = "V".id)) Buffers: shared hit=100602308 Planning: Buffers: shared hit=6 Planning Time: 0.694 ms Execution Time: 182428.614 ms (60 rows) ```

jrgemignani commented 4 months ago

@Kobaruon How long a query takes, or is expected to take, depends on the dataset, the query, and machine it is running on. This particular query, while looking simple, searches for a match regardless of direction -[]- these are notoriously slow queries.

Kobaruon commented 4 months ago

@jrgemignani it takes around 3 minutes to run, we expect around 1 second execution time. DB vm has 8 cores 32GB memory, but the data size is small ~600mb.

We have a very similar query, same table structure and same indexes. On that table, query below runs under 1 second.

SELECT * from cypher('passv2', $$
        MATCH (V)-[R:TCP]-(V2)
        RETURN V,R,V2 LIMIT 100
$$) as (V agtype, R agtype, V2 agtype);

I mean yes without LIMIT it takes ages to run, but with LIMIT we expect fast execution.

Edge label counts don't know if that matters.

MironAtHome commented 4 months ago

@Kobaruon It would be very nice if you could share Explain Analyze for query 2, the fast one. And postgres version :) Based on reading first plan time was spent here: Gather (cost=1000.57..414641535.58 rows=1372 width=96) (actual time=180465.329..182428.494 rows=100 loops=1) I am looking at what this can actually mean and what options available to tune this aspect of query, if possible.

The first relevant article hinted at disabling this, but with caution, that it should only applied locally to specific query. Similar, but slightly different take on forcing single thread for the query here. Please note, this may not be actually a fix, and needs to be tested first, before attempting in production environment.

Kobaruon commented 4 months ago

@MironAtHome I am using postgresql 16.3

SELECT * from cypher('passv2', $$
        MATCH (V)-[R:TCP]-(V2)
        RETURN V,R,V2 LIMIT 100
$$) as (V agtype, R agtype, V2 agtype);

Explain analyze of the fast query

``` QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=1001.71..11361.42 rows=100 width=96) (actual time=378.834..517.894 rows=100 loops=1) -> Gather (cost=1001.71..49215657.81 rows=475058 width=96) (actual time=378.832..517.876 rows=100 loops=1) Workers Planned: 2 Workers Launched: 2 -> Nested Loop (cost=1.71..49167152.01 rows=197941 width=96) (actual time=232.618..504.085 rows=43 loops=3) -> Parallel Append (cost=0.00..2285.66 rows=2488 width=179) (actual time=0.012..0.022 rows=6 loops=3) -> Parallel Seq Scan on mars "V2_2" (cost=0.00..1203.92 rows=2264 width=178) (actual time=0.003..0.005 rows=5 loops=1) -> Parallel Seq Scan on pod "V2_5" (cost=0.00..980.36 rows=745 width=264) (actual time=0.024..0.031 rows=7 loops=1) -> Parallel Seq Scan on dne "V2_6" (cost=0.00..45.92 rows=464 width=53) (never executed) -> Parallel Seq Scan on venus "V2_3" (cost=0.00..11.58 rows=19 width=164) (never executed) -> Parallel Seq Scan on earth "V2_4" (cost=0.00..3.49 rows=16 width=159) (actual time=0.002..0.003 rows=2 loops=1) -> Parallel Seq Scan on smars "V2_7" (cost=0.00..2.04 rows=1 width=52) (actual time=0.001..0.003 rows=2 loops=1) -> Parallel Seq Scan on _ag_label_vertex "V2_1" (cost=0.00..1.02 rows=1 width=35) (actual time=0.007..0.008 rows=1 loops=1) -> Parallel Seq Scan on abc "V2_8" (cost=0.00..0.00 rows=1 width=51) (actual time=0.001..0.001 rows=0 loops=1) -> Nested Loop (cost=1.71..19580.50 rows=5970 width=244) (actual time=52.545..88.865 rows=8 loops=17) -> Append (cost=0.00..2411.62 rows=5970 width=179) (actual time=0.002..3.528 rows=5287 loops=17) -> Seq Scan on _ag_label_vertex "V_1" (cost=0.00..1.03 rows=1 width=35) (actual time=0.002..0.002 rows=1 loops=17) -> Seq Scan on mars "V_2" (cost=0.00..1251.47 rows=3849 width=178) (actual time=0.004..1.836 rows=3498 loops=17) -> Seq Scan on venus "V_3" (cost=0.00..11.99 rows=33 width=164) (actual time=0.008..0.022 rows=33 loops=15) -> Seq Scan on earth "V_4" (cost=0.00..3.84 rows=28 width=159) (actual time=0.004..0.010 rows=28 loops=15) -> Seq Scan on pod "V_5" (cost=0.00..996.01 rows=1267 width=264) (actual time=0.005..1.007 rows=1227 loops=15) -> Seq Scan on dne "V_6" (cost=0.00..55.67 rows=789 width=53) (actual time=0.006..0.139 rows=789 loops=14) -> Seq Scan on smars "V_7" (cost=0.00..2.06 rows=2 width=52) (actual time=0.016..0.018 rows=2 loops=14) -> Seq Scan on abc "V_8" (cost=0.00..0.00 rows=1 width=51) (actual time=0.002..0.002 rows=0 loops=14) -> Bitmap Heap Scan on "TCP" "R" (cost=1.71..2.85 rows=1 width=65) (actual time=0.014..0.014 rows=0 loops=89876) Recheck Cond: (((end_id = "V2".id) AND (start_id = "V".id)) OR ((end_id = "V".id) AND (start_id = "V2".id))) Heap Blocks: exact=7 -> BitmapOr (cost=1.71..1.71 rows=1 width=0) (actual time=0.014..0.014 rows=0 loops=89876) -> Bitmap Index Scan on idx_tcp_end_start (cost=0.00..0.85 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=89876) Index Cond: ((end_id = "V2".id) AND (start_id = "V".id)) -> Bitmap Index Scan on idx_tcp_end_start (cost=0.00..0.85 rows=1 width=0) (actual time=0.012..0.012 rows=0 loops=89876) Index Cond: ((end_id = "V".id) AND (start_id = "V2".id)) Planning Time: 3.619 ms Execution Time: 518.101 ms (34 rows) ```

I tried your recommendation and it seems to make the query faster, however the query itself is already faster without making any changes for some reason. It's still slower than what we expected, but it's a great improvement.

Original: Takes 19.8sec https://explain.dalibo.com/plan/dhe464cb1579b734

Tweaked: 2.7sec

BEGIN;
SET LOCAL parallel_leader_participation = off;
explain analyze SELECT * from cypher('passv2', $$
        MATCH (V)-[R:WORKED_ON]-(V2)
        RETURN V,R,V2
LIMIT 100
$$) as (V agtype, R agtype, V2 agtype);

https://explain.dalibo.com/plan/5f29h6473626777e

MironAtHome commented 4 months ago

Please see additional link, forcing single threaded execution. And thank you for sharing feedback.

Kobaruon commented 4 months ago

Sorry completely missed that, but it seems to increase execution time.

SET max_parallel_workers_per_gather = 0; https://explain.dalibo.com/plan/8a9475c9329f762f

MironAtHome commented 4 months ago

Ok, looks like experiment hints that SET LOCAL parallel_leader_participation = off; is the better performing option. One thing that isn't 100% clear to me is the bi-directional query design. May I ask you to run queries like this, and if both produce identical results, than we can further tune query by using directional attribute in the MATCH, else, what improvement is available seems reasonable. However, it needs to be confirmed ( only on test instance of the server ). I am not sure how to clean - up buffers ont he running server. On my servers I am allowed to perform any administrative action, so, I would just restart. But please see if this is something doable where you are. Usually RDBMS has commands to clean up memory buffers to run query "anew", as if it was never run before. Once performance gains confirmed, it counts. Give me just a moment to write queries, I will update same comment. Please see if these two queries produce identical output. Query 1:

SELECT * from cypher('passv2', $$
        MATCH (V)-[R:WORKED_ON]->(V2)
        RETURN V,R,V2
        ORDER BY id(V) ASC, id(R) ASC, id(V2) ASC
LIMIT 100
$$) as (V agtype, R agtype, V2 agtype);

Query 2:

SELECT * from cypher('passv2', $$
        MATCH (V)<-[R:WORKED_ON]-(V2)
        RETURN V,R,V2
        ORDER BY id(V) ASC, id(R) ASC, id(V2) ASC
LIMIT 100
$$) as (V agtype, R agtype, V2 agtype);

Seeing row counts askew in the query plan, depending on direction of relationship, provoked my curiosity, so, I would like to confirm.

Kobaruon commented 4 months ago

I restarted the server and tried again with SET LOCAL parallel_leader_participation = off; and the results are same. So, it clearly works.

If you give any direction query runs much much faster and yes seems like these two produce idential output. Query 1: https://explain.dalibo.com/plan/gc622ba38g91afb1

Query 2: https://explain.dalibo.com/plan/14bd391394c9758c

MironAtHome commented 4 months ago

Since 100 rows do produce identical output, I would suggest running without limit ( I would expect to produce also, identical output ) once. And than apply directional attribute, either way, permanently, as part of query design. One more thing to check, and this is for virtual machine, rather than Postgres RDBMS engine or query tuning. Please see what memory configuration is applied on virtual machine running Postgres RDMBS. This isn't really an OS. It's purely VM configuration. The feature to check is memory allocation. There are two main settings for this feature

dynamic memory allocation
fixed memory allocation It seems to me there is a possibility that this VM is configured with dynamic memory allocation. Here is a nice article describing what this feature and how it works. The tuning option you were able to confirm SET LOCAL parallel_leader_participation = off; is friendly to this VM feature. Please note, I am not suggesting to modify memory allocation policy, as this may, potentially, impact other VM's running on the same host, and this is something that has, potentially, wide impact, as RDBMS are very greedy to memory RAM resource. However, it should help to understand as to why some queries may run slower, particular, what causes the Gatherer process wait. As it needs certain amount of memory resource to gather rows together and the determination, how much memory it needs is driven by rather arcane rules of RDBMS engine ( in this case Postgres ) that may, potentially, ask for more than just number of bytes, rows would strictly fit into. What we did with tuning option applied, is we promoted, what is called, an streaming approach, where RDBMS tries to do away with as little resources of memory RAM as possible, just enough to first copy rows out from the row reading component of RDBMS out to row producing component. If directional attribute proves working, it further promotes saving of resources.

jrgemignani commented 4 months ago

@Kobaruon As I mentioned earlier -[]-, directionless, is a lot more intensive. You are trying to match any edge that has any connection to the currently selected vertices; directed edges will always run faster.

While you may have a similar query that runs faster, and is also directionless, the runtime depends on the total number of vertices and edges that are selected and how connected they are. It could be that the faster query involves a less connected set than the slower one. Just a thought without seeing the actual data.

Edit: I should mention that the match logic is likely inefficient for directionless edges as AGE only builds digraphs. This is something our team should look into, in the future, given time.

MironAtHome commented 4 months ago

I restarted the server and tried again with SET LOCAL parallel_leader_participation = off; and the results are same. So, it clearly works.

If you give any direction query runs much much faster and yes seems like these two produce idential output. Query 1: https://explain.dalibo.com/plan/gc622ba38g91afb1

Query 2: https://explain.dalibo.com/plan/14bd391394c9758c

After studying closely behavior of operators ()-[]-() and ()-[]->() it is evident, that data will be presented as following: If, for example, one is to create two vertices v:Person {name:'Alice'} v:Person {name:'Bob'} and follows by creating two edges Alice-[:knows]->Bob and Alice<-[:knows]-Bob with the id's assigned as following: Alice-id:1 Bob-id:2 Edge :knows for the relationship Alice-[:knows]->Bob id 3 Edge :knows for the relationship Alice<-[:knows]-Bob id 4 Query Match (v1)-[:knows]-(v2) Will return the following combination of id's Row #1: 1,3,2 Row #2: 1,4,2 Row #3: 2,3,1 Row #4: 2,4,1 However, if we were to present not id's ( which is really an system values used by machine internals ) and produce just the distinct combination of names

Match (v1)-[:knows]-(v2) WITH v1.name as Name1, v2.name as Name2 RETURN DISTINCT Name1, Name2 ORDER BY Name1 ASC, Name2 ASC

We will get back Alice, Bob Bob, Alice

Now, if we were to issue query (v1)-[:knows]->(v2) For the id producing query we will get only half the rows Row #1: 1,3,2 Row #2: 2,4,1

However, for the name producing query, there would be no difference.

In other words, (v1)-[:knows]-(v2)

Behaves as (v1)-[:knows]->(v2) UNION (v1)<-[:knows]-(v2)

To summarise, looking at the last query just one line above, it's 2x the number of rows stored, with the same relationship listed twice, except switching the sides of the data.

In case producing rows once with relationship going left to right is acceptable, query can be expressed as (v1)-[:]->(v2) without losing resolution. It makes sense to test performance of UNION query, if not to get better performance, than to explore one more option. And finally, if performance of query is essential, it could be of use to produce various permutations of data using client?

Hope this helps.

Kobaruon commented 4 months ago

Thanks everyone for the quick help and detailed analysis.

apache / age

Simple query takes long time to run #1974