Make the Null Filtering query more performant

Some examples. I'm adding this to my fixtures/data.sql file:

INSERT INTO test.clients (id, name) SELECT n+2, 'Test ' || n FROM generate_series(1,1000) AS n;

INSERT INTO test.projects (id, name, client_id) SELECT (n-1)*100+m+5, 'Test ' || m, n+2 FROM generate_series(1,1000) AS n, LATERAL generate_series(1,100) AS m;

CREATE INDEX projects_client_id ON projects (client_id);

Not a lot of data, but it shows.

`IS DISTINCT FROM NULL` for a single row

Query

```sql EXPLAIN ANALYZE WITH pgrst_source AS (SELECT projects.id, "test"."projects"."name", row_to_json("projects_clients_1".*) AS "clients" FROM "test"."projects" LEFT JOIN LATERAL (SELECT "clients_1"."name" FROM "test"."clients" AS "clients_1" WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE WHERE "projects_clients_1" IS DISTINCT FROM NULL) SELECT NULL::bigint AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, coalesce(json_agg(_postgrest_t), '[]') AS body, nullif(current_setting('response.headers', TRUE), '') AS response_headers, nullif(current_setting('response.status', TRUE), '') AS response_status FROM (SELECT * FROM pgrst_source WHERE id=1000) _postgrest_t; ```

                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16.63..16.65 rows=1 width=112) (actual time=0.028..0.029 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..16.62 rows=1 width=43) (actual time=0.014..0.016 rows=1 loops=1)
         ->  Index Scan using projects_pkey on projects  (cost=0.29..8.31 rows=1 width=15) (actual time=0.007..0.008 rows=1 loops=1)
               Index Cond: (id = 1000)
         ->  Index Scan using clients_pkey on clients clients_1  (cost=0.28..8.29 rows=1 width=36) (actual time=0.005..0.005 rows=1 loops=1)
               Index Cond: (id = projects.client_id)
               Filter: (ROW(name) IS DISTINCT FROM NULL)
 Planning Time: 0.219 ms
 Execution Time: 0.056 ms
(9 rows)

"join column is not null" for a single row

Query

```sql EXPLAIN ANALYZE WITH pgrst_source AS (SELECT projects.id, "test"."projects"."name", "projects_clients_1"."clients_1_json" AS "clients" FROM "test"."projects" LEFT JOIN LATERAL (SELECT row_to_json("clients_1_row".*) AS "clients_1_json", "clients_1"."id" FROM "test"."clients" AS "clients_1", LATERAL (SELECT "clients_1"."name") AS "clients_1_row") AS "projects_clients_1" ON "projects_clients_1"."id" = "test"."projects"."client_id" WHERE "projects_clients_1"."id" IS NOT NULL) SELECT NULL::bigint AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, coalesce(json_agg(_postgrest_t), '[]') AS body, nullif(current_setting('response.headers', TRUE), '') AS response_headers, nullif(current_setting('response.status', TRUE), '') AS response_status FROM (SELECT * FROM pgrst_source WHERE id=1000) _postgrest_t; ```

                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16.63..16.65 rows=1 width=112) (actual time=0.021..0.021 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..16.62 rows=1 width=43) (actual time=0.013..0.013 rows=1 loops=1)
         ->  Index Scan using projects_pkey on projects  (cost=0.29..8.31 rows=1 width=15) (actual time=0.005..0.006 rows=1 loops=1)
               Index Cond: (id = 1000)
         ->  Index Scan using clients_pkey on clients clients_1  (cost=0.28..8.30 rows=1 width=36) (actual time=0.005..0.005 rows=1 loops=1)
               Index Cond: ((id = projects.client_id) AND (id IS NOT NULL))
 Planning Time: 0.650 ms
 Execution Time: 0.044 ms
(8 rows)

`IS DISTINCT FROM NULL` for all rows

Query

                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2829.22..2829.24 rows=1 width=112) (actual time=158.937..158.939 rows=1 loops=1)
   ->  Hash Join  (cost=28.48..1834.16 rows=99506 width=43) (actual time=0.533..25.425 rows=100004 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=15) (actual time=0.006..4.866 rows=100005 loops=1)
         ->  Hash  (cost=16.02..16.02 rows=997 width=36) (actual time=0.515..0.516 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 73kB
               ->  Seq Scan on clients clients_1  (cost=0.00..16.02 rows=997 width=36) (actual time=0.007..0.337 rows=1002 loops=1)
                     Filter: (ROW(name) IS DISTINCT FROM NULL)
 Planning Time: 0.287 ms
 Execution Time: 158.977 ms
(10 rows)

"join column is not null" for all rows

Query

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2336.75..2336.77 rows=1 width=112) (actual time=75.254..75.255 rows=1 loops=1)
   ->  Hash Join  (cost=31.05..1836.73 rows=100005 width=39) (actual time=0.493..21.285 rows=100004 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=11) (actual time=0.004..4.369 rows=100005 loops=1)
         ->  Hash  (cost=18.52..18.52 rows=1002 width=36) (actual time=0.485..0.486 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 63kB
               ->  Seq Scan on clients clients_1  (cost=0.00..18.52 rows=1002 width=36) (actual time=0.007..0.376 rows=1002 loops=1)
                     Filter: (id IS NOT NULL)
 Planning Time: 0.241 ms
 Execution Time: 75.278 ms
(10 rows)

Twice as fast compared to the IS DISTINCT approach.

Note that until now, all plans where basically the same. But consider this:

`IS NOT DISTINCT FROM NULL` for all rows (anti join)

Query

```sql EXPLAIN ANALYZE WITH pgrst_source AS (SELECT projects.id, "test"."projects"."name", row_to_json("projects_clients_1".*) AS "clients" FROM "test"."projects" LEFT JOIN LATERAL (SELECT "clients_1"."name" FROM "test"."clients" AS "clients_1" WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE WHERE "projects_clients_1" IS NOT DISTINCT FROM NULL) SELECT NULL::bigint AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, coalesce(json_agg(_postgrest_t), '[]') AS body, nullif(current_setting('response.headers', TRUE), '') AS response_headers, nullif(current_setting('response.status', TRUE), '') AS response_status FROM (SELECT * FROM pgrst_source) _postgrest_t; ```

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1839.22..1839.24 rows=1 width=112) (actual time=24.212..24.214 rows=1 loops=1)
   ->  Hash Left Join  (cost=28.55..1834.22 rows=500 width=43) (actual time=0.347..24.203 rows=1 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         Filter: ((ROW(clients_1.name)) IS NOT DISTINCT FROM NULL)
         Rows Removed by Filter: 100004
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=15) (actual time=0.006..6.062 rows=100005 loops=1)
         ->  Hash  (cost=16.02..16.02 rows=1002 width=36) (actual time=0.334..0.335 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 73kB
               ->  Seq Scan on clients clients_1  (cost=0.00..16.02 rows=1002 width=36) (actual time=0.006..0.191 rows=1002 loops=1)
 Planning Time: 0.215 ms
 Execution Time: 24.242 ms
(11 rows)

Note that we still have a left join here for the distinct approach...

"join column is null" for all rows (anti join)

Query

```sql EXPLAIN ANALYZE WITH pgrst_source AS (SELECT projects.id, "test"."projects"."name", "projects_clients_1"."clients_1_json" AS "clients" FROM "test"."projects" LEFT JOIN LATERAL (SELECT row_to_json("clients_1_row".*) AS "clients_1_json", "clients_1"."id" FROM "test"."clients" AS "clients_1", LATERAL (SELECT "clients_1"."name") AS "clients_1_row") AS "projects_clients_1" ON "projects_clients_1"."id" = "test"."projects"."client_id" WHERE "projects_clients_1"."id" IS NULL) SELECT NULL::bigint AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, coalesce(json_agg(_postgrest_t), '[]') AS body, nullif(current_setting('response.headers', TRUE), '') AS response_headers, nullif(current_setting('response.status', TRUE), '') AS response_status FROM (SELECT * FROM pgrst_source) _postgrest_t; ```

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1948.13..1948.15 rows=1 width=112) (actual time=22.187..22.188 rows=1 loops=1)
   ->  Hash Anti Join  (cost=31.05..1948.12 rows=1 width=39) (actual time=0.736..22.178 rows=1 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=11) (actual time=0.004..6.207 rows=100005 loops=1)
         ->  Hash  (cost=18.52..18.52 rows=1002 width=36) (actual time=0.726..0.726 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 63kB
               ->  Seq Scan on clients clients_1  (cost=0.00..18.52 rows=1002 width=36) (actual time=0.008..0.562 rows=1002 loops=1)
 Planning Time: 0.217 ms
 Execution Time: 22.217 ms
(9 rows)

... but a hash anti join node for the is null approach.

Of course the performance of this query is roughly the same - because there is only one project without client. But this shows that postgresql can rewrite / optimize the query better when not using the distinct approach.

PostgREST / postgrest