cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

sql: warn/guard again misuse of `INSERT .. ON CONFLICT` and partial indexes #119117

Open sunshowers opened 9 months ago

sunshowers commented 9 months ago

Describe the problem

Postgres offers two ways to express conditional upserts with INSERT INTO:

  1. ON CONFLICT ... WHERE ... DO UPDATE
  2. ON CONFLICT ... DO UPDATE ... WHERE

(Though the Postgres docs say in style 1, the WHERE only takes an index_predicate, which I'm not sure what the difference is)

Most users use style 2, which works fine. However, with 1, the condition in the WHERE clause is silently dropped.

We hit this because with the Rust Diesel ORM, the only way to express such a query appears to be with style 1.

We're tracking this in https://github.com/oxidecomputer/omicron/issues/5047.

To Reproduce

Start up a single-node cluster:

./cockroach start-single-node --insecure

Then, connect to it with cockroach sql in another terminal and issue the following commands:

> CREATE TABLE foo (id UUID PRIMARY KEY, data INT NOT NULL);

> INSERT INTO foo (id, data) VALUES ('fe0abe6b-f07a-43c4-a82f-9258a6ed89cc', 1);

> INSERT INTO foo (id, data) VALUES ('fe0abe6b-f07a-43c4-a82f-9258a6ed89cc', 1) ON CONFLICT (id) WHERE data = 2 DO UPDATE SET data = 3;

> SELECT * FROM foo;

The output of the SELECT command is:

                   id                  | data
---------------------------------------+-------
  fe0abe6b-f07a-43c4-a82f-9258a6ed89cc |    3

It is even possible to pass in a column that does not exist:

> INSERT INTO foo (id, data) VALUES ('fe0abe6b-f07a-43c4-a82f-9258a6ed89cc', 1) ON CONFLICT (id) WHERE non_existent = 1 DO UPDATE SET data = 4;

> SELECT * FROM foo;

                   id                  | data
---------------------------------------+-------
  fe0abe6b-f07a-43c4-a82f-9258a6ed89cc |    4

Expected behavior

I expected to either see data = 1 for the row, or for the query to be rejected.

Environment:

Jira issue: CRDB-36038

blathers-crl[bot] commented 9 months ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

sunshowers commented 9 months ago

We hit this because with the Rust Diesel ORM, the only way to express such a query appears to be with style 1.

For anyone searching for this in the future -- turns out this is not quite correct because Diesel's FilterDsl trait is implemented for these kinds of statements, even though it doesn't show up in the documentation.

sunshowers commented 9 months ago

Interestingly, Postgres itself has the same issue, where the WHERE clause is silently ignored with the above queries. However, Postgres does error out for non-existent columns.

michae2 commented 9 months ago

Thanks for opening an issue!

(Though the Postgres docs say in style 1, the WHERE only takes an index_predicate, which I'm not sure what the difference is)

My understanding of this WHERE before the DO UPDATE is that it is not the same as a normal WHERE, and is only used to pick a partial unique index to use as the arbiter index. (I think it's supposed to match the WHERE that would be in the CREATE UNIQUE INDEX ... (...) WHERE ... statement of a partial unique index.)

Looking at the plan of your insert, we can see that the ON CONFLICT (id) WHERE data = 2 is causing us to pick the primary index as the arbiter, and we're ignoring the WHERE data = 2 since there is no partial unique index with this predicate.

demo@127.0.0.1:26257/demoapp/defaultdb> EXPLAIN INSERT INTO foo (id, data) VALUES ('fe0abe6b-f07a-43c4-a82f-9258a6ed89cc', 1) ON CONFLICT (id) WHERE data = 2 DO UPDATE SET data = 3;
                                                      info
----------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • upsert
  │ into: foo(id, data)
  │ auto commit
  │ arbiter indexes: foo_pkey    <- picking primary index as arbiter
  │
  └── • render
      │
      └── • render
          │
          └── • cross join (left outer)
              │ estimated row count: 1
              │
              ├── • values
              │     size: 2 columns, 1 row
              │
              └── • scan
                    estimated row count: 1 (100% of the table; stats collected 7 minutes ago)
                    table: foo@foo_pkey
                    spans: [/'fe0abe6b-f07a-43c4-a82f-9258a6ed89cc' - /'fe0abe6b-f07a-43c4-a82f-9258a6ed89cc']
                    locking strength: for update

If we add WHERE foo.data = 2 as a normal WHERE clause at the end, it does show up in the plan:

demo@127.0.0.1:26257/demoapp/defaultdb> EXPLAIN INSERT INTO foo (id, data) VALUES ('fe0abe6b-f07a-43c4-a82f-9258a6ed89cc', 1) ON CONFLICT (id) WHERE data = 2 DO UPDATE SET data = 3 WHERE foo.data = 2;
                                                        info
--------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • upsert
  │ into: foo(id, data)
  │ auto commit
  │ arbiter indexes: foo_pkey
  │
  └── • render
      │
      └── • render
          │
          └── • filter
              │ estimated row count: 1
              │ filter: (id IS NULL) OR (data = 2)    <- filter on WHERE foo.data = 2 predicate
              │
              └── • cross join (left outer)
                  │ estimated row count: 1
                  │
                  ├── • values
                  │     size: 2 columns, 1 row
                  │
                  └── • scan
                        estimated row count: 1 (100% of the table; stats collected 9 minutes ago)
                        table: foo@foo_pkey
                        spans: [/'fe0abe6b-f07a-43c4-a82f-9258a6ed89cc' - /'fe0abe6b-f07a-43c4-a82f-9258a6ed89cc']

If we want to see the first WHERE in action, we need to add a partial unique index that matches the predicate, and we need to change the row to match that predicate:

CREATE UNIQUE INDEX ON foo (data) WHERE data = 2;
UPDATE foo SET data = 2 WHERE id = 'fe0abe6b-f07a-43c4-a82f-9258a6ed89cc';

Now if we change the row we're inserting to have data=2, and change ON CONFLICT (id) WHERE data = 2 to ON CONFLICT (data) WHERE data = 2 we pick the partial unique index as the arbiter:

demo@127.0.0.1:26257/demoapp/defaultdb> EXPLAIN INSERT INTO foo (id, data) VALUES ('fe0abe6b-f07a-43c4-a82f-9258a6ed89cc', 2) ON CONFLICT (data) WHERE data = 2 DO UPDATE SET data = 3;
                                               info
--------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • upsert
  │ into: foo(id, data)
  │ auto commit
  │ arbiter indexes: foo_data_key    <- using the partial unique index as arbiter
  │
  └── • render
      │
      └── • render
          │
          └── • cross join (left outer)
              │ estimated row count: 1
              │
              ├── • values
              │     size: 2 columns, 1 row
              │
              └── • scan
                    estimated row count: 0 (<0.01% of the table; stats collected 52 seconds ago)
                    table: foo@foo_data_key (partial index)
                    spans: FULL SCAN
                    locking strength: for update
(23 rows)

demo@127.0.0.1:26257/demoapp/defaultdb> INSERT INTO foo (id, data) VALUES ('fe0abe6b-f07a-43c4-a82f-9258a6ed89cc', 2) ON CONFLICT (data) WHERE data = 2 DO UPDATE SET data = 3;
INSERT 0 1

demo@127.0.0.1:26257/demoapp/defaultdb> SELECT * FROM foo;
                   id                  | data
---------------------------------------+-------
  fe0abe6b-f07a-43c4-a82f-9258a6ed89cc |    3
(1 row)

It does seem like we should throw an error if we're not finding a partial unique index matching the predicate as an arbiter.

sunshowers commented 9 months ago

Thank you for explaining the difference!

Based on what you said, yeah, I think throwing an error would be ideal. (I wonder how many bugs around this are lurking in various systems.)

mgartner commented 9 months ago

It does seem like we should throw an error if we're not finding a partial unique index matching the predicate as an arbiter.

I believe the lack of an error is the intended behavior. The reason CRDB and Postgres don't throw errors is because the primary index is a valid arbiter that can detect conflicts for the INSERT .. ON CONFLICT .. WHERE statement. A unique, non-partial index holds columns unique over all subsets of rows, so it can be used as an arbiter for an INSERT .. ON CONFLICT .. with any WHERE clause (even WHERE false!).

So this is something to keep in mind with an INSERT .. ON CONFLICT .. WHERE statement. It won't look for arbiter indexes with partial index predicates that exactly match the insert's WHERE clause. Instead, it will look for arbiter indexes with partial index predicates that are implied by the insert's WHERE clause, or, in other words, that are guaranteed to contain all the rows matching the insert's WHERE clause. In this case, the primary index is essentially a "partial index" with a predicate WHERE true which is implied by all possible expressions.

I do agree that we should error in the case where a column in the insert's WHERE clause does not exist. That's a bug.

mgartner commented 9 months ago

Quick follow-up:

The Postgres docs make the behavior clear:

index_predicate — Used to allow inference of partial unique indexes. Any indexes that satisfy the predicate (which need not actually be partial indexes) can be inferred. Follows CREATE INDEX format. SELECT privilege on any column appearing within index_predicate is required.

Our docs are not clear on this. In fact, I don't even see a mention of this ON CONFLICT .. WHERE variant of INSERT. I've created a docs issue for this: https://github.com/cockroachdb/docs/issues/18292

michae2 commented 9 months ago

It won't look for arbiter indexes with partial index predicates that exactly match the insert's WHERE clause. Instead, it will look for arbiter indexes with partial index predicates that are implied by the insert's WHERE clause, or, in other words, that are guaranteed to contain all the rows matching the insert's WHERE clause. In this cause, the primary index is essentially a "partial index" with a predicate WHERE true which is implied by all possible expressions.

Ah, ok, this explains why we weren't picking an index created with CREATE UNIQUE INDEX ON foo (id) WHERE data = 2 in my experiments! Thanks!

mgartner commented 9 months ago

Ah, ok, this explains why we weren't picking an index created with CREATE UNIQUE INDEX ON foo (id) WHERE data = 2 in my experiments! Thanks!

Exactly. There's a special case "fast path" where if we have a non-partial index for the ON CONFLICT columns then we use only that index as the arbiter because it's all we need to catch conflicts across all subsets of rows. If we have only partial unique indexes, we use all that match to ensure that we catch all conflicts across all subsets that are held to uniqueness of the columns. That's explained more here: https://github.com/cockroachdb/cockroach/blob/fb58047a7315330ed844340b6889ab42445eb03e/pkg/sql/opt/optbuilder/mutation_builder_arbiter.go#L129-L153

sunshowers commented 9 months ago

I'm definitely not an expert around Postgres, but from my perspective, not erroring out or even producing a warning for an ignored clause is pretty rough. It's also been completely new and surprising to my coworkers. (But I'll defer to y'all of course.)

mgartner commented 8 months ago

I don't thinks it's accurate to say that the WHERE clause is ignored, but I agree this is dreadfully confusing. The WHERE clause here is never used to filter rows. It's only used to allow the system to infer arbiter indexes which can detect conflicts. This example shows how it's used:

CREATE TABLE t (
  i INT,
  b BOOL
);

CREATE UNIQUE INDEX i1 ON t(i) WHERE b;
CREATE UNIQUE INDEX i2 ON t(i) WHERE NOT b;

INSERT INTO t VALUES (1, true);

-- This INSERT succeeds. With the "WHERE b" clause the system infers i1 as an
-- arbiter to detect and avoid conflicts, so the conflict of (1, true) is
-- detected and an UPDATE is performed instead of an INSERT. The (1, false) row
-- does not conflict, so an INSERT is performed.
INSERT INTO t VALUES (1, true), (1, false) ON CONFLICT (i) WHERE b DO UPDATE SET i = 11;

SELECT * FROM t;
--   i  | b
-- -----+----
--   11 | t
--    1 | f
-- (2 rows)

-- If we perform a similar INSERT again, it fails. The conflict of (1, false) is
-- not detected because "WHERE b" does not allow i2 to be used as an arbiter
-- ("WHERE b" does not imply "WHERE NOT b"). As a result, the system attempts to
-- INSERT (1, false), which fails.
INSERT INTO t VALUES (1, true), (1, false) ON CONFLICT (i) WHERE b DO UPDATE SET i = 111;
-- ERROR: duplicate key value violates unique constraint "i2"
-- SQLSTATE: 23505
-- DETAIL: Key (i)=(1) already exists.
-- CONSTRAINT: i2

There's added strangeness when you use this WHERE clause and a non-partial unique index exists:

DROP TABLE IF EXISTS t;
CREATE TABLE t (
  i INT,
  b BOOL
);

CREATE UNIQUE INDEX i1 ON t(i);
CREATE UNIQUE INDEX i2 ON t(i) WHERE b;

INSERT INTO t VALUES (1, false);

-- This INSERT succeeds. The system infers the non-partial unique index i1 as
-- the arbiter, so a conflict is detected and an UPDATE is performed instead of
-- an INSERT. We might reasonably expect this to only detect conflicts when the
-- INSERT's value for b is true and for the INSERT to fail.
INSERT INTO t VALUES (1, false) ON CONFLICT (i) WHERE b DO UPDATE SET i = 11;

SELECT * FROM t;
--  i  | b
-- ----+---
--  11 | f
-- (1 row)

I can't provide a very reasonable argument for why the INSERT in the example above succeeds. The strongest argument I can see is that:

  1. An arbiter index is inferred if the WHERE clause in the INSERT implies the unique index's WHERE clause.
  2. i1 can be thought of as a partial index with the predicate WHERE true.
  3. All expressions, including WHERE b, imply WHERE true.
  4. So i1 can be inferred as an arbiter index.

This matches the quote from Postgres's documentation in my earlier response:

index_predicate — Used to allow inference of partial unique indexes. Any indexes that satisfy the predicate (which need not actually be partial indexes) can be inferred.

But this justification sounds like an implementation detail. It doesn't appear to match the declarative nature of SQL. It feels more like we're telling the system how to do something than telling the system what the logical outcome should be.

Perhaps this behavior is more accidental than intentional, and Postgres (and now CockroachDB) are bound to maintain it. I can imagine that with the addition of partial indexes there was a new need to be able to differentiate between conflicts of multiple partial indexes. This method of inference with a WHERE clause was probably chosen because it closely matches how the system determines if a partial index can be used to satisfy a query. And the behavior in the presence on non-partial index is perhaps an unintended side-effect.

We could add a WARNING notice if an INSERT .. ON CONFLICT has a WHERE clause and a non-partial arbiter is selected. I think in that case it's reasonable to assume that the user's expected behavior and the statement's actual behavior may not match.

sunshowers commented 8 months ago

Having spent more time experimenting, I agree with what you said. I think, at the very least, producing a warning if ON CONFLICT ... WHERE is specified with a full index would be great.

FWIW I wrote up my findings here. We've made it so that we no longer permit ON CONFLICT ... WHERE generally, and for partial indexes we use WHERE false to make crdb automatically determine the right bounds. We don't have any need for more specific bounds than the one specified on the index, so it works out.

https://github.com/oxidecomputer/omicron/blob/ae39f2633e177bae3c53bddf9e9fc196facdcb61/nexus/db-queries/src/db/on_conflict_ext.rs#L20-L307

mgartner commented 8 months ago

Thank you for sharing your notes. I learned quite a few things!

It's interesting that the WHERE false trick does not work in Postgres. My guess is that their "implication prover" doesn't understand that false implies everything because it's useless in practice when proving whether or not a partial index can be used in a SELECT. If the filter is false then there's no need to scan any index - the result set is always empty.

I do have some bad news, unfortunately. The WHERE false trick will not work in CockroachDB if:

  1. The INSERT has ON CONFLICT .. DO UPDATE form (DO NOTHING is fine).
  2. The table has two partial unique indexes on the same set of columns.
  3. And, the table does not have a non-partial unique index on that set of columns.

You'll run into the limitation described in https://github.com/cockroachdb/cockroach/issues/53170 and the insert will result in an error. For example:

CREATE TABLE t (
  i INT,
  j INT,
  b BOOL
);

CREATE UNIQUE INDEX i1 ON t(i, j) WHERE NOT b;
CREATE UNIQUE INDEX i2 ON t(j, i) WHERE b;

INSERT INTO t VALUES (1, 2) ON CONFLICT (i, j) WHERE false DO UPDATE SET i = 11;
-- ERROR: unimplemented: there are multiple unique or exclusion constraints matching the ON CONFLICT specification
-- SQLSTATE: 0A000
-- HINT: You have attempted to use a feature that is not yet implemented.
-- See: https://go.crdb.dev/issue-v/53170/v23.1

-- If we add a non-partial unique index, the INSERT succeeds because the
-- non-partial index is selected as the sole arbiter.
CREATE UNIQUE INDEX i0 ON t(j, i);

INSERT INTO t VALUES (1, 2) ON CONFLICT (i, j) WHERE false DO UPDATE SET i = 11;
-- INSERT 0 1

There is some comfort in the fact that you'll always run into this error given the conditions above, regardless of the insert values. So I think you'd likely catch this during development or testing, as long as you have coverage of these types of INSERTs.

sunshowers commented 8 months ago

Thank you so much for the further investigation -- this is great additional context that I should add to those notes.

I do consider the ON CONFLICT ... WHERE false thing, while clever, to be a hack that I'd like to get rid of long-term. A warning (maybe being able to opt-in to making it an error) if a ON CONFLICT ... WHERE clause is passed in with a full index would probably get us most of the way there. There could also be some kind of basic relevance check for partial indexes, though the exact details may have to be worked out.

We do test all of our SQL flows against a real CockroachDB instance, so hopefully future changes in the semantics will be caught early.

sunshowers commented 8 months ago

Out of curiosity, do you use a SAT or SMT solver for implication proving?

mgartner commented 8 months ago

I've adjusted the title of this issue to be focused on a warning or guard against misuse with ON CONFLICT and partial indexes so that all the context in this thread remains. I've created a new issue to track the bug that allows invalid references in WHERE expressions: #119389.


Out of curiosity, do you use a SAT or SMT solver for implication proving?

No, we do something simpler. You can find more details here.

I have little knowledge of SAT and SMT solvers, but I believe that they are more robust than our implication proving and have greater time/space complexity. We can tolerate the occasional false negative—in the worst case this results in a suboptimal query plan, not incorrect results. Speed is critical because we must prove implication N = num_partial_indexes_in_referenced_tables * num_query_plan_transformations_that_could_use_a_partial_index times during optimization-time of a query.

sunshowers commented 8 months ago

Thanks for the reference! That was an interesting read.

While SAT is theoretically NP-complete, and many of the theories of SMT solvers are undecidable -- for typical queries, they've been heavily optimized and tend to be reasonably fast. They also have a notion of "resource limits" (abstract instruction limits), so you can timeout (and produce a negative) if a query takes too long.

This is based on the SMT solver I've used the most, Z3. On my machine (Ryzen 7950x, Linux 6.6.10), this simple script using Z3Py:

#!/usr/bin/env python

from z3 import *
import time

start = time.monotonic()

x = Int('x')
y = Int('y')

s = Solver()
s.add(Not(Implies(And(x > 5, y > 2), x > 10)))
r = s.check()
print(r)

print(s.statistics())

end = time.monotonic()
print("Time:", end - start)

shows:

% ./foo.py   
sat
(:elim-unconstrained  2
 :max-memory          29.21
 :memory              20.51
 :num-allocs          1057327
 :rlimit-count        313
 :sat-decisions       4
 :sat-mk-clause-2ary  2
 :sat-mk-var          4
 :solve-eqs-elim-vars 1
 :solve-eqs-steps     1
 :time                0.01)
Time: 0.016529261134564877

So that took around 16ms to execute (and some of that would just be Python overhead).


Anyway, this is out of scope for the immediate issue, just thought it might be interesting. And thanks again for being open to suggestions here, it's really appreciated.