Open laurenceisla opened 1 year ago
Some examples. I'm adding this to my fixtures/data.sql
file:
INSERT INTO test.clients (id, name) SELECT n+2, 'Test ' || n FROM generate_series(1,1000) AS n;
INSERT INTO test.projects (id, name, client_id) SELECT (n-1)*100+m+5, 'Test ' || m, n+2 FROM generate_series(1,1000) AS n, LATERAL generate_series(1,100) AS m;
CREATE INDEX projects_client_id ON projects (client_id);
Not a lot of data, but it shows.
IS DISTINCT FROM NULL
for a single row QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16.63..16.65 rows=1 width=112) (actual time=0.028..0.029 rows=1 loops=1)
-> Nested Loop (cost=0.57..16.62 rows=1 width=43) (actual time=0.014..0.016 rows=1 loops=1)
-> Index Scan using projects_pkey on projects (cost=0.29..8.31 rows=1 width=15) (actual time=0.007..0.008 rows=1 loops=1)
Index Cond: (id = 1000)
-> Index Scan using clients_pkey on clients clients_1 (cost=0.28..8.29 rows=1 width=36) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: (id = projects.client_id)
Filter: (ROW(name) IS DISTINCT FROM NULL)
Planning Time: 0.219 ms
Execution Time: 0.056 ms
(9 rows)
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16.63..16.65 rows=1 width=112) (actual time=0.021..0.021 rows=1 loops=1)
-> Nested Loop (cost=0.57..16.62 rows=1 width=43) (actual time=0.013..0.013 rows=1 loops=1)
-> Index Scan using projects_pkey on projects (cost=0.29..8.31 rows=1 width=15) (actual time=0.005..0.006 rows=1 loops=1)
Index Cond: (id = 1000)
-> Index Scan using clients_pkey on clients clients_1 (cost=0.28..8.30 rows=1 width=36) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: ((id = projects.client_id) AND (id IS NOT NULL))
Planning Time: 0.650 ms
Execution Time: 0.044 ms
(8 rows)
IS DISTINCT FROM NULL
for all rows QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2829.22..2829.24 rows=1 width=112) (actual time=158.937..158.939 rows=1 loops=1)
-> Hash Join (cost=28.48..1834.16 rows=99506 width=43) (actual time=0.533..25.425 rows=100004 loops=1)
Hash Cond: (projects.client_id = clients_1.id)
-> Seq Scan on projects (cost=0.00..1542.05 rows=100005 width=15) (actual time=0.006..4.866 rows=100005 loops=1)
-> Hash (cost=16.02..16.02 rows=997 width=36) (actual time=0.515..0.516 rows=1002 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 73kB
-> Seq Scan on clients clients_1 (cost=0.00..16.02 rows=997 width=36) (actual time=0.007..0.337 rows=1002 loops=1)
Filter: (ROW(name) IS DISTINCT FROM NULL)
Planning Time: 0.287 ms
Execution Time: 158.977 ms
(10 rows)
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2336.75..2336.77 rows=1 width=112) (actual time=75.254..75.255 rows=1 loops=1)
-> Hash Join (cost=31.05..1836.73 rows=100005 width=39) (actual time=0.493..21.285 rows=100004 loops=1)
Hash Cond: (projects.client_id = clients_1.id)
-> Seq Scan on projects (cost=0.00..1542.05 rows=100005 width=11) (actual time=0.004..4.369 rows=100005 loops=1)
-> Hash (cost=18.52..18.52 rows=1002 width=36) (actual time=0.485..0.486 rows=1002 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 63kB
-> Seq Scan on clients clients_1 (cost=0.00..18.52 rows=1002 width=36) (actual time=0.007..0.376 rows=1002 loops=1)
Filter: (id IS NOT NULL)
Planning Time: 0.241 ms
Execution Time: 75.278 ms
(10 rows)
Twice as fast compared to the IS DISTINCT approach.
Note that until now, all plans where basically the same. But consider this:
IS NOT DISTINCT FROM NULL
for all rows (anti join) QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1839.22..1839.24 rows=1 width=112) (actual time=24.212..24.214 rows=1 loops=1)
-> Hash Left Join (cost=28.55..1834.22 rows=500 width=43) (actual time=0.347..24.203 rows=1 loops=1)
Hash Cond: (projects.client_id = clients_1.id)
Filter: ((ROW(clients_1.name)) IS NOT DISTINCT FROM NULL)
Rows Removed by Filter: 100004
-> Seq Scan on projects (cost=0.00..1542.05 rows=100005 width=15) (actual time=0.006..6.062 rows=100005 loops=1)
-> Hash (cost=16.02..16.02 rows=1002 width=36) (actual time=0.334..0.335 rows=1002 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 73kB
-> Seq Scan on clients clients_1 (cost=0.00..16.02 rows=1002 width=36) (actual time=0.006..0.191 rows=1002 loops=1)
Planning Time: 0.215 ms
Execution Time: 24.242 ms
(11 rows)
Note that we still have a left join here for the distinct approach...
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1948.13..1948.15 rows=1 width=112) (actual time=22.187..22.188 rows=1 loops=1)
-> Hash Anti Join (cost=31.05..1948.12 rows=1 width=39) (actual time=0.736..22.178 rows=1 loops=1)
Hash Cond: (projects.client_id = clients_1.id)
-> Seq Scan on projects (cost=0.00..1542.05 rows=100005 width=11) (actual time=0.004..6.207 rows=100005 loops=1)
-> Hash (cost=18.52..18.52 rows=1002 width=36) (actual time=0.726..0.726 rows=1002 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 63kB
-> Seq Scan on clients clients_1 (cost=0.00..18.52 rows=1002 width=36) (actual time=0.008..0.562 rows=1002 loops=1)
Planning Time: 0.217 ms
Execution Time: 22.217 ms
(9 rows)
... but a hash anti join node for the is null approach.
Of course the performance of this query is roughly the same - because there is only one project without client. But this shows that postgresql can rewrite / optimize the query better when not using the distinct approach.
After the discussion that started in https://github.com/PostgREST/postgrest/pull/2951#issuecomment-1720780468, the conclusion is that changing
table IS DISTINCT FROM NULL
totable.join_column IS NOT NULL
should be more performant because PostgreSQL would treat that last one as an INNER JOIN.Perhaps we need to change how we build the embedding sub queries to allow this. The
IS DISTINCT FROM
is specified here:https://github.com/PostgREST/postgrest/blob/add4dfeed5b1d5aa6028a053c921953f04af825d/src/PostgREST/Query/SqlFragment.hs#L340