PostgREST / postgrest

REST API for any Postgres database
https://postgrest.org
MIT License
23.27k stars 1.02k forks source link

Make the Null Filtering query more performant #2961

Open laurenceisla opened 1 year ago

laurenceisla commented 1 year ago

After the discussion that started in https://github.com/PostgREST/postgrest/pull/2951#issuecomment-1720780468, the conclusion is that changing table IS DISTINCT FROM NULL to table.join_column IS NOT NULL should be more performant because PostgreSQL would treat that last one as an INNER JOIN.

Perhaps we need to change how we build the embedding sub queries to allow this. The IS DISTINCT FROM is specified here:

https://github.com/PostgREST/postgrest/blob/add4dfeed5b1d5aa6028a053c921953f04af825d/src/PostgREST/Query/SqlFragment.hs#L340

wolfgangwalther commented 1 year ago

Some examples. I'm adding this to my fixtures/data.sql file:

INSERT INTO test.clients (id, name) SELECT n+2, 'Test ' || n FROM generate_series(1,1000) AS n;

INSERT INTO test.projects (id, name, client_id) SELECT (n-1)*100+m+5, 'Test ' || m, n+2 FROM generate_series(1,1000) AS n, LATERAL generate_series(1,100) AS m;

CREATE INDEX projects_client_id ON projects (client_id);

Not a lot of data, but it shows.

IS DISTINCT FROM NULL for a single row

Query ```sql EXPLAIN ANALYZE WITH pgrst_source AS (SELECT projects.id, "test"."projects"."name", row_to_json("projects_clients_1".*) AS "clients" FROM "test"."projects" LEFT JOIN LATERAL (SELECT "clients_1"."name" FROM "test"."clients" AS "clients_1" WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE WHERE "projects_clients_1" IS DISTINCT FROM NULL) SELECT NULL::bigint AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, coalesce(json_agg(_postgrest_t), '[]') AS body, nullif(current_setting('response.headers', TRUE), '') AS response_headers, nullif(current_setting('response.status', TRUE), '') AS response_status FROM (SELECT * FROM pgrst_source WHERE id=1000) _postgrest_t; ```
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16.63..16.65 rows=1 width=112) (actual time=0.028..0.029 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..16.62 rows=1 width=43) (actual time=0.014..0.016 rows=1 loops=1)
         ->  Index Scan using projects_pkey on projects  (cost=0.29..8.31 rows=1 width=15) (actual time=0.007..0.008 rows=1 loops=1)
               Index Cond: (id = 1000)
         ->  Index Scan using clients_pkey on clients clients_1  (cost=0.28..8.29 rows=1 width=36) (actual time=0.005..0.005 rows=1 loops=1)
               Index Cond: (id = projects.client_id)
               Filter: (ROW(name) IS DISTINCT FROM NULL)
 Planning Time: 0.219 ms
 Execution Time: 0.056 ms
(9 rows)

"join column is not null" for a single row

Query ```sql EXPLAIN ANALYZE WITH pgrst_source AS (SELECT projects.id, "test"."projects"."name", "projects_clients_1"."clients_1_json" AS "clients" FROM "test"."projects" LEFT JOIN LATERAL (SELECT row_to_json("clients_1_row".*) AS "clients_1_json", "clients_1"."id" FROM "test"."clients" AS "clients_1", LATERAL (SELECT "clients_1"."name") AS "clients_1_row") AS "projects_clients_1" ON "projects_clients_1"."id" = "test"."projects"."client_id" WHERE "projects_clients_1"."id" IS NOT NULL) SELECT NULL::bigint AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, coalesce(json_agg(_postgrest_t), '[]') AS body, nullif(current_setting('response.headers', TRUE), '') AS response_headers, nullif(current_setting('response.status', TRUE), '') AS response_status FROM (SELECT * FROM pgrst_source WHERE id=1000) _postgrest_t; ```
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16.63..16.65 rows=1 width=112) (actual time=0.021..0.021 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..16.62 rows=1 width=43) (actual time=0.013..0.013 rows=1 loops=1)
         ->  Index Scan using projects_pkey on projects  (cost=0.29..8.31 rows=1 width=15) (actual time=0.005..0.006 rows=1 loops=1)
               Index Cond: (id = 1000)
         ->  Index Scan using clients_pkey on clients clients_1  (cost=0.28..8.30 rows=1 width=36) (actual time=0.005..0.005 rows=1 loops=1)
               Index Cond: ((id = projects.client_id) AND (id IS NOT NULL))
 Planning Time: 0.650 ms
 Execution Time: 0.044 ms
(8 rows)

IS DISTINCT FROM NULL for all rows

Query ```sql EXPLAIN ANALYZE WITH pgrst_source AS (SELECT projects.id, "test"."projects"."name", row_to_json("projects_clients_1".*) AS "clients" FROM "test"."projects" LEFT JOIN LATERAL (SELECT "clients_1"."name" FROM "test"."clients" AS "clients_1" WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE WHERE "projects_clients_1" IS DISTINCT FROM NULL) SELECT NULL::bigint AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, coalesce(json_agg(_postgrest_t), '[]') AS body, nullif(current_setting('response.headers', TRUE), '') AS response_headers, nullif(current_setting('response.status', TRUE), '') AS response_status FROM (SELECT * FROM pgrst_source) _postgrest_t; ```
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2829.22..2829.24 rows=1 width=112) (actual time=158.937..158.939 rows=1 loops=1)
   ->  Hash Join  (cost=28.48..1834.16 rows=99506 width=43) (actual time=0.533..25.425 rows=100004 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=15) (actual time=0.006..4.866 rows=100005 loops=1)
         ->  Hash  (cost=16.02..16.02 rows=997 width=36) (actual time=0.515..0.516 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 73kB
               ->  Seq Scan on clients clients_1  (cost=0.00..16.02 rows=997 width=36) (actual time=0.007..0.337 rows=1002 loops=1)
                     Filter: (ROW(name) IS DISTINCT FROM NULL)
 Planning Time: 0.287 ms
 Execution Time: 158.977 ms
(10 rows)

"join column is not null" for all rows

Query ```sql EXPLAIN ANALYZE WITH pgrst_source AS (SELECT projects.id, "test"."projects"."name", "projects_clients_1"."clients_1_json" AS "clients" FROM "test"."projects" LEFT JOIN LATERAL (SELECT row_to_json("clients_1_row".*) AS "clients_1_json", "clients_1"."id" FROM "test"."clients" AS "clients_1", LATERAL (SELECT "clients_1"."name") AS "clients_1_row") AS "projects_clients_1" ON "projects_clients_1"."id" = "test"."projects"."client_id" WHERE "projects_clients_1"."id" IS NOT NULL) SELECT NULL::bigint AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, coalesce(json_agg(_postgrest_t), '[]') AS body, nullif(current_setting('response.headers', TRUE), '') AS response_headers, nullif(current_setting('response.status', TRUE), '') AS response_status FROM (SELECT * FROM pgrst_source) _postgrest_t; ```
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2336.75..2336.77 rows=1 width=112) (actual time=75.254..75.255 rows=1 loops=1)
   ->  Hash Join  (cost=31.05..1836.73 rows=100005 width=39) (actual time=0.493..21.285 rows=100004 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=11) (actual time=0.004..4.369 rows=100005 loops=1)
         ->  Hash  (cost=18.52..18.52 rows=1002 width=36) (actual time=0.485..0.486 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 63kB
               ->  Seq Scan on clients clients_1  (cost=0.00..18.52 rows=1002 width=36) (actual time=0.007..0.376 rows=1002 loops=1)
                     Filter: (id IS NOT NULL)
 Planning Time: 0.241 ms
 Execution Time: 75.278 ms
(10 rows)

Twice as fast compared to the IS DISTINCT approach.

Note that until now, all plans where basically the same. But consider this:

IS NOT DISTINCT FROM NULL for all rows (anti join)

Query ```sql EXPLAIN ANALYZE WITH pgrst_source AS (SELECT projects.id, "test"."projects"."name", row_to_json("projects_clients_1".*) AS "clients" FROM "test"."projects" LEFT JOIN LATERAL (SELECT "clients_1"."name" FROM "test"."clients" AS "clients_1" WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE WHERE "projects_clients_1" IS NOT DISTINCT FROM NULL) SELECT NULL::bigint AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, coalesce(json_agg(_postgrest_t), '[]') AS body, nullif(current_setting('response.headers', TRUE), '') AS response_headers, nullif(current_setting('response.status', TRUE), '') AS response_status FROM (SELECT * FROM pgrst_source) _postgrest_t; ```
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1839.22..1839.24 rows=1 width=112) (actual time=24.212..24.214 rows=1 loops=1)
   ->  Hash Left Join  (cost=28.55..1834.22 rows=500 width=43) (actual time=0.347..24.203 rows=1 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         Filter: ((ROW(clients_1.name)) IS NOT DISTINCT FROM NULL)
         Rows Removed by Filter: 100004
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=15) (actual time=0.006..6.062 rows=100005 loops=1)
         ->  Hash  (cost=16.02..16.02 rows=1002 width=36) (actual time=0.334..0.335 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 73kB
               ->  Seq Scan on clients clients_1  (cost=0.00..16.02 rows=1002 width=36) (actual time=0.006..0.191 rows=1002 loops=1)
 Planning Time: 0.215 ms
 Execution Time: 24.242 ms
(11 rows)

Note that we still have a left join here for the distinct approach...

"join column is null" for all rows (anti join)

Query ```sql EXPLAIN ANALYZE WITH pgrst_source AS (SELECT projects.id, "test"."projects"."name", "projects_clients_1"."clients_1_json" AS "clients" FROM "test"."projects" LEFT JOIN LATERAL (SELECT row_to_json("clients_1_row".*) AS "clients_1_json", "clients_1"."id" FROM "test"."clients" AS "clients_1", LATERAL (SELECT "clients_1"."name") AS "clients_1_row") AS "projects_clients_1" ON "projects_clients_1"."id" = "test"."projects"."client_id" WHERE "projects_clients_1"."id" IS NULL) SELECT NULL::bigint AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, coalesce(json_agg(_postgrest_t), '[]') AS body, nullif(current_setting('response.headers', TRUE), '') AS response_headers, nullif(current_setting('response.status', TRUE), '') AS response_status FROM (SELECT * FROM pgrst_source) _postgrest_t; ```
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1948.13..1948.15 rows=1 width=112) (actual time=22.187..22.188 rows=1 loops=1)
   ->  Hash Anti Join  (cost=31.05..1948.12 rows=1 width=39) (actual time=0.736..22.178 rows=1 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=11) (actual time=0.004..6.207 rows=100005 loops=1)
         ->  Hash  (cost=18.52..18.52 rows=1002 width=36) (actual time=0.726..0.726 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 63kB
               ->  Seq Scan on clients clients_1  (cost=0.00..18.52 rows=1002 width=36) (actual time=0.008..0.562 rows=1002 loops=1)
 Planning Time: 0.217 ms
 Execution Time: 22.217 ms
(9 rows)

... but a hash anti join node for the is null approach.

Of course the performance of this query is roughly the same - because there is only one project without client. But this shows that postgresql can rewrite / optimize the query better when not using the distinct approach.