diesel-rs / diesel

A safe, extensible ORM and Query Builder for Rust
https://diesel.rs
Apache License 2.0
12.79k stars 1.08k forks source link

Fixes #4349 #4350

Open weiznich opened 2 days ago

weiznich commented 2 days ago

This commit fixes an issue that allowed to trigger a runtime error by passing an array of arrays to .eq_any(). We rewrite queries containing IN expressions to = ANY() on postgresql as that more efficient (allows caching + allows binding all values at once). That's not possible for arrays of arrays as we do not support nested arrays yet.

The fix introduces an associated constant for the SqlType trait that tracks if a SQL type is a Array<T> or not. This information is then used to conditionally generate the "right" SQL. By defaulting to false while adding this constant we do not break existing code. We use the derive to set the right value based on the type name.

cc @Ten0

(I would suggest to include that PR in the patch release as well)

Ten0 commented 2 days ago

Thanks a lot for working on this! ⚡️ It seems ok to go with the hidden associated const. 🙂 I had not considered making this not part of the public API, but it looks like it does solve my concern that we may publicly release something that is not as "powerful" (or fundamentally appropriate considering that we don't have associated const equality) as having a type alias would be and we would want to change it later on. I did mostly convince myself that additional array nesting levels can't compile for other reasons anyway so I think I'm happy with not being able to do that. :)

However regarding the approach of = ANY (VALUES (...), (...)) vs IN, IN seems to be mostly a syntax sugar for expanding into multiple OR clauses, whereas = ANY (VALUES (...), (...)) generates the following query plan:

Nested Loop  (cost=0.19..16.41 rows=2 width=40) (actual time=0.021..0.023 rows=2 loops=1)
  ->  Unique  (cost=0.04..0.04 rows=2 width=32) (actual time=0.015..0.016 rows=2 loops=1)
        ->  Sort  (cost=0.04..0.04 rows=2 width=32) (actual time=0.014..0.015 rows=2 loops=1)
              Sort Key: "*VALUES*".column1
              Sort Method: quicksort  Memory: 25kB
              ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=32) (actual time=0.001..0.001 rows=2 loops=1)
  ->  Index Scan using value_index_key on table  (cost=0.15..8.17 rows=1 width=40) (actual time=0.002..0.002 rows=1 loops=2)
        Index Cond: (value = "*VALUES*".column1)
Planning Time: 0.231 ms
Execution Time: 0.033 ms

which seems more appropriate when there are many values. That being said I've also seen occurrences where Postgres would switch to seq scan/bitmap heap scan with recheck on the array in O(n²) when doing = ANY (ARRAY), which given this query plan leaves me wondering whether doing = ANY (VALUES (...), (...)) wouldn't solve this issue even in the non-array case 😅

So overall it looks like we want the one that behaves the closest to what = ANY (ARRAY) does, and if relevant for other scenarios we might want to open a VALUES subquery API for smarter query plans on large arrays.

I'll investigate further on what query plans get generated for all 3 cases tomorrow. 🙂

I would suggest to include that PR in the patch release as well

This seems edge-case enough (considering that unless I missed an existing issue nobody else has reported it since Diesel 2.0) that I think I would tend to not consider it a release blocker 🙂

Ten0 commented 2 days ago

So I have investigated query plans a bit more and it looks like doing IN is probably the most consistent approach, because IN gets rewritten as = ANY in a lot of cases and VALUESgenerates different query plans from ANY/IN even in the non-array case.

Ten0 commented 2 days ago

TL;DR: argument for using = ANY(subquery), but discarded in next post.

So I found this comment: https://dba.stackexchange.com/a/335502/196014

Which suggests that = ANY may actually be not-that-dumb:

When the length of the array is very large, the other queries simply do a parallel seq scan. This is very fast, because the "Filter where id=ANY(...)" isn't dumb, it uses some kind of fast search like hashing or bisect, it doesn't compare every row with every value of the array.

So it seems that = ANY ( ... ) may sometimes be more performant than IN

But my experience was contrary to this (it was in fact, dumb, and I would see n² CPU consumption profiles).

So I've performed new tests (see code below), and I can find:

Maybe I should send a message to the PG mailing list pointing that as a bug because the two recheck-on-array scenarios are very inconsistent with regards to their algorithmic complexity...

So I don't know what to think... If = ANY is supposed to be smart but IN isn't when we pass in an array, maybe we want to do the "good worse case" of VALUES to not have surprising perf loss for users who don't expect any significant difference of perf between eq_any(array_of_values) and eq_any(array_of_values_that_happen_to_also_be_arrays).

But if ANY is known-sometimes-dumb and from the PG perspective it's user's responsibility to use VALUES instead of = ANY(ARRAY) when relevant, then as stated above the more consistent approach of using IN seems to win.

There is also a case for doing IN anyway because that's the closest-to-SQL. But if = ANY(VALUES) is really the always-safest maybe we want to expose that and recommend its usage over that of eq_any.

Tests:

BEGIN;
CREATE TABLE tmp_benchmark(
    id Int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    value text NOT NULL
);
-- Insert 1M values
INSERT INTO tmp_benchmark(value) SELECT DISTINCT (floor(random() * 1000000000000)::int8)::text FROM generate_series(1,1000000);

ALTER TABLE tmp_benchmark ADD CONSTRAINT tmp_benchmark_unique_value_and_corresponding_index UNIQUE(value);

CREATE INDEX tmp_benchmark_benchmark_value_hash ON tmp_benchmark USING HASH (value);

ANALYZE tmp_benchmark;

-- Randomly sample 50k of those values
CREATE TEMPORARY TABLE lookup_test ON COMMIT DROP AS
WITH sub AS (SELECT value FROM tmp_benchmark LIMIT 10000000)
SELECT value FROM sub ORDER BY RANDOM() LIMIT 50000;

EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (SELECT value FROM lookup_test);
/* Gives:
Hash Semi Join  (cost=1445.00..20996.25 rows=50000 width=20) (actual time=4.595..107.748 rows=50000 loops=1)
  Hash Cond: (tmp_benchmark.value = lookup_test.value)
  ->  Seq Scan on tmp_benchmark  (cost=0.00..16370.00 rows=1000000 width=20) (actual time=0.008..39.863 rows=1000000 loops=1)
  ->  Hash  (cost=820.00..820.00 rows=50000 width=12) (actual time=4.566..4.567 rows=50000 loops=1)
        Buckets: 65536  Batches: 1  Memory Usage: 2704kB
        ->  Seq Scan on lookup_test  (cost=0.00..820.00 rows=50000 width=12) (actual time=0.006..1.836 rows=50000 loops=1)
Planning Time: 0.247 ms
Execution Time: 108.723 ms
 */
EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark  (cost=985.09..1024.13 rows=10 width=20) (actual time=26.626..14440.230 rows=50000 loops=1)
  Recheck Cond: (value = ANY ($0))
  Rows Removed by Index Recheck: 7
  Heap Blocks: exact=6369
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=945.00..945.01 rows=1 width=32) (actual time=4.606..4.607 rows=1 loops=1)
          ->  Seq Scan on lookup_test  (cost=0.00..820.00 rows=50000 width=12) (actual time=0.007..1.998 rows=50000 loops=1)
  ->  Bitmap Index Scan on tmp_benchmark_benchmark_value_hash  (cost=0.00..40.08 rows=10 width=0) (actual time=25.848..25.848 rows=50007 loops=1)
        Index Cond: (value = ANY ($0))
Planning Time: 0.098 ms
Execution Time: 14441.520 ms
 */

-- Randomly sample 100k of those values

CREATE TEMPORARY TABLE lookup_test_2 ON COMMIT DROP AS
WITH sub AS (SELECT value FROM tmp_benchmark LIMIT 10000000)
SELECT value FROM sub ORDER BY RANDOM() LIMIT 100000;

EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (SELECT value FROM lookup_test_2);
/* Gives:
Nested Loop  (cost=1555.20..3025.00 rows=78336 width=20) (actual time=15.002..101.907 rows=100000 loops=1)
  ->  HashAggregate  (cost=1555.20..1557.20 rows=200 width=32) (actual time=14.995..26.162 rows=100000 loops=1)
        Group Key: lookup_test_2.value
        Batches: 5  Memory Usage: 6977kB  Disk Usage: 232kB
        ->  Seq Scan on lookup_test_2  (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.006..4.192 rows=100000 loops=1)
  ->  Index Scan using tmp_benchmark_benchmark_value_hash on tmp_benchmark  (cost=0.00..7.88 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=100000)
        Index Cond: (value = lookup_test_2.value)
        Rows Removed by Index Recheck: 0
Planning Time: 0.148 ms
Execution Time: 103.924 ms
 */
EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark  (cost=1595.29..1634.33 rows=10 width=20) (actual time=44.862..58360.269 rows=100000 loops=1)
  Recheck Cond: (value = ANY ($0))
  Rows Removed by Index Recheck: 16
  Heap Blocks: exact=6370
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=1555.20..1555.21 rows=1 width=32) (actual time=9.109..9.110 rows=1 loops=1)
          ->  Seq Scan on lookup_test_2  (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.007..4.002 rows=100000 loops=1)
  ->  Bitmap Index Scan on tmp_benchmark_benchmark_value_hash  (cost=0.00..40.08 rows=10 width=0) (actual time=44.242..44.243 rows=100018 loops=1)
        Index Cond: (value = ANY ($0))
Planning Time: 0.089 ms
Execution Time: 58364.590 ms
 */

-- Note how with 2x the number of values, execution time is x4 the above. Monitoring CPU while this query runs shows 100% CPU, unlike the subselect-based variant.

DROP INDEX tmp_benchmark_benchmark_value_hash;
-- Once we remove the hash index, PG is not stupid anymore, even with the same 100k sample and bitmap heap scan plan it doesn't do n² somehow

-- On the same 100k sample:
EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark  (cost=1599.54..1638.58 rows=10 width=20) (actual time=236.604..252.188 rows=100000 loops=1)
  Recheck Cond: (value = ANY ($0))
  Heap Blocks: exact=6370
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=1555.20..1555.21 rows=1 width=32) (actual time=9.091..9.093 rows=1 loops=1)
          ->  Seq Scan on lookup_test_2  (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.009..3.946 rows=100000 loops=1)
  ->  Bitmap Index Scan on tmp_benchmark_unique_value_and_corresponding_index  (cost=0.00..44.33 rows=10 width=0) (actual time=236.169..236.169 rows=100000 loops=1)
        Index Cond: (value = ANY ($0))
Planning Time: 0.161 ms
Execution Time: 254.145 ms
*/

-- So despite doing a bitmap heap scan on the same 100k elements array as above, this one seemingly doesn't n².

ROLLBACK;
Ten0 commented 2 hours ago

For completeness: while the "reproducer" above is indeed n² with expressions like = ANY(ARRAY(subselect)), I can't seem to reproduce the same n² with an actual = ANY($1) anymore, so that in fact probably behaves fine (https://github.com/postgres/postgres/commit/50e17ad281b8d1c1b410c9833955bc80fbad4078), which in turn means that the array case of array_value IN (arr1,arr2,arr3) being expanded as array_value = arr1 OR array_value = arr2 may be:

  1. Way slower than people might expect because the OR are rechecked linearly (n²)
  2. Crashing faster, because postgres will return a stack depth limit exceeded error when there are more than 10k values.