sql: multiple CTEs with mutations on same row can cause inconsistency

michae2 commented 3 years ago

In https://github.com/cockroachdb/cockroach/issues/44466 we found that an upsert statement modifying the same row twice could lead to inconsistencies, due to the upsert operator not reading its own writes. This was fixed in https://github.com/cockroachdb/cockroach/pull/45372 by checking that the input to the upsert is distinct on PKs. But now @florence-crl and @erikgrinaker have discovered another way to upsert the same row multiple times in a single statement: using CTEs.

For example:

CREATE TABLE a (i INT PRIMARY KEY, j INT, INDEX (j));
INSERT INTO a VALUES (0, 0);
WITH x AS (UPSERT INTO a VALUES (0, 1) RETURNING j), y AS (UPSERT INTO a VALUES (0, 2) RETURNING j) SELECT * FROM x;

Now, another execution of the last statement fails with a dupe key error (confusingly on a non-unique secondary index, but this is because the PK values are also the same, so it is a duplicate key from KV's perspective):

root@127.0.0.1:26257/defaultdb> WITH x AS (UPSERT INTO a VALUES (0, 1) RETURNING j), y AS (UPSERT INTO a VALUES (0, 2) RETURNING j) SELECT * FROM x;
ERROR: duplicate key value violates unique constraint "a_j_idx"
SQLSTATE: 23505
DETAIL: Key (j)=(1) already exists.
CONSTRAINT: a_j_idx

And furthermore we can see the inconsistency directly:

root@127.0.0.1:26257/defaultdb> SELECT i, j FROM a@primary;
  i | j
----+----
  0 | 2
(1 row)

Time: 2ms total (execution 2ms / network 0ms)

root@127.0.0.1:26257/defaultdb> SELECT i, j FROM a@a_j_idx;
  i | j
----+----
  0 | 1
  0 | 2
(2 rows)

Time: 2ms total (execution 1ms / network 0ms)

Jira issue: CRDB-10192

erikgrinaker commented 3 years ago

Now, another execution of the last statement fails with a dupe key error (confusingly on a non-unique secondary index, but this is because the PK values are also the same, so it is a duplicate key from KV's perspective)

I'm not sure if this is the explanation for the bogus error. I noticed that writing to secondary indexes will use e.g. InitPut and other conditional write operations. These operations will return ConditionFailedError whenever the condition is false (e.g. in the case of InitPut, when some value already exists for the key), but SQL will simply convert all ConditionFailedError into UniquenessConstraintViolationError regardless of what the actual cause of the error was:

https://github.com/cockroachdb/cockroach/blob/20c81ddc8684487a29f0ecb95636fb142f9a02a3/pkg/sql/row/errors.go#L59-L72

It seems more likely to me that this was e.g. caused by an InitPut encountering an existing key where it didn't expect one, or something similar. Or maybe it just got confused about which index was violated. May be worth looking at a trace to find out what's going on.

michae2 commented 3 years ago

Here's the KV trace:

INSERT INTO a VALUES (0, 0);

 querying next range at /NamespaceTable/30/1/50/29/"a"/4/1
 r26: sending batch 1 Get to (n1,s1):1
 CPut /Table/52/1/0/0 -> /TUPLE/2:2:Int/0
 InitPut /Table/52/2/0/0/0 -> /BYTES/
 querying next range at /Table/52/1/0/0
 r44: sending batch 1 CPut, 1 EndTxn, 1 InitPut to (n1,s1):1
 fast path completed
 rows affected: 1

WITH x AS (UPSERT INTO a VALUES (0, 1) RETURNING j), y AS (UPSERT INTO a VALUES (0, 2) RETURNING j) SELECT * FROM x;

 Scan /Table/52/1/0/0
 querying next range at /Table/52/1/0/0
 r44: sending batch 1 Get to (n1,s1):1
 fetched: /a/primary/0/j -> /0
 Put /Table/52/1/0/0 -> /TUPLE/2:2:Int/1
 Del /Table/52/2/0/0/0
 CPut /Table/52/2/1/0/0 -> /BYTES/ (expecting does not exist)
 querying next range at /Table/52/1/0/0
 r44: sending batch 1 Put, 1 CPut, 1 Del to (n1,s1):1
 Scan /Table/52/1/0/0
 querying next range at /Table/52/1/0/0
 r44: sending batch 1 Get, 1 QueryIntent to (n1,s1):1
 fetched: /a/primary/0/j -> /0                            <<<<<<<<<<<<< failing to read our own write
 Put /Table/52/1/0/0 -> /TUPLE/2:2:Int/2
 Del /Table/52/2/0/0/0                                    <<<<<<<<<<<<< should be 52/2/1/0/0
 CPut /Table/52/2/2/0/0 -> /BYTES/ (expecting does not exist)
 querying next range at /Table/52/1/0/0
 r44: sending batch 1 Put, 1 CPut, 1 Del, 1 QueryIntent to (n1,s1):1
 rows affected: 1
 querying next range at /Table/52/1/0/0
 r44: sending batch 1 EndTxn to (n1,s1):1

WITH x AS (UPSERT INTO a VALUES (0, 1) RETURNING j), y AS (UPSERT INTO a VALUES (0, 2) RETURNING j) SELECT * FROM x;

 Scan /Table/52/1/0/0
 querying next range at /Table/52/1/0/0
 r44: sending batch 1 Get to (n1,s1):1
 fetched: /a/primary/0/j -> /2
 Put /Table/52/1/0/0 -> /TUPLE/2:2:Int/1
 Del /Table/52/2/2/0/0
 CPut /Table/52/2/1/0/0 -> /BYTES/ (expecting does not exist)         <<<<<<<<<<<<<< already exists
 querying next range at /Table/52/1/0/0
 r44: sending batch 1 Put, 1 CPut, 1 Del to (n1,s1):1
 execution failed after 0 rows: duplicate key value violates unique constraint "a_j_idx"
 querying next range at /Table/52/1/0/0
 r44: sending batch 1 EndTxn to (n1,s1):1

So I think you are right.

erikgrinaker commented 3 years ago

Thanks for checking! We may want to see if we can improve the error handling here while we're at it.

michae2 commented 3 years ago

@mgartner says this is related to a problem with INSERT ON CONFLICT DO UPDATE.

Edit: this issue

michae2 commented 3 years ago

This can also be caused with UPDATE / UPDATE:

CREATE TABLE a (i INT PRIMARY KEY, j INT, INDEX (j));
INSERT INTO a VALUES (0, 0);
WITH x AS (UPDATE a SET j = 1 WHERE i = 0 RETURNING j), y AS (UPDATE a SET j = 2 WHERE i = 0 RETURNING j) SELECT * FROM x;
SELECT i, j FROM a@primary;
SELECT i, j FROM a@a_j_idx;

as well as UPDATE / DELETE:

CREATE TABLE a (i INT PRIMARY KEY, j INT, INDEX (j));
INSERT INTO a VALUES (0, 0);
WITH x AS (UPDATE a SET j = 1 WHERE i = 0 RETURNING j), y AS (DELETE FROM a WHERE i = 0 RETURNING j) SELECT * FROM x;
SELECT i, j FROM a@primary;
SELECT i, j FROM a@a_j_idx;

and probably also INSERT ON CONFLICT and many other combinations thereof.

jordanlewis commented 3 years ago

I don't understand the mechanics of this error. Reading the KV trace that Michael sent, we see that the second Scan we send, which comes after the first subquery completes and has sent its Put batch to KV, reads the initial value for the primary key:

 Scan /Table/52/1/0/0
 querying next range at /Table/52/1/0/0
 r44: sending batch 1 Get, 1 QueryIntent to (n1,s1):1
 fetched: /a/primary/0/j -> /0                            <<<<<<<<<<<<< failing to read our own write

But why is this happening? Isn't it true that we've already executed the previous Puts in KV? I would expect at this point, we'd get back our own writes when we scanned them, just like we would if we opened a transaction, wrote a row, and then read the row back. What is different here?

I'm guessing I'm missing something obvious, but I don't really understand this. This anomaly is different from #45372 - in that case, it makes sense that the operator might want to cache information - it doesn't read its own writes, but that's because it doesn't try to by flushing its batches and re-reading, not because it can't... or so I thought, anyway.

jordanlewis commented 3 years ago

In other words, why does this query fail to read its own write:

jordan@free-tier7.gcp-us-central1.crdb.io:26257/tpcc> CREATE TABLE a (i INT PRIMARY KEY, j INT, INDEX (j));
jordan@free-tier7.gcp-us-central1.crdb.io:26257/tpcc> INSERT INTO a VALUES (0, 0);
jordan@free-tier7.gcp-us-central1.crdb.io:26257/tpcc> WITH x AS (UPDATE a SET j = 1 WHERE i = 0 RETURNING j) SELECT * FROM a;
  i | j
----+----
  0 | 0              -- OWN WRITE NOT READ! 0 0, not 0 1
(1 row)

But this query succeeds?

jordan@free-tier7.gcp-us-central1.crdb.io:26257/tpcc> BEGIN;
jordan@free-tier7.gcp-us-central1.crdb.io:26257/tpcc  OPEN> UPDATE a SET j = 2 WHERE i = 0 RETURNING j;
  j
-----
  2
(1 row)
jordan@free-tier7.gcp-us-central1.crdb.io:26257/tpcc  OPEN> SELECT * FROM a;
  i | j
----+----
  0 | 2         -- OWN WRITE READ - 0 2 correctly returned, not 0 1
(1 row)

jordanlewis commented 3 years ago

I see, according to these comments, it's expected that mutations can't read their own writes, and we've deliberately set things up so that this is maintained.

https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/conn_executor_exec.go#L640-L646

jordanlewis commented 3 years ago

Maybe we need a special read mode for scans that returns a special flag or condition if we see a write to a key at the same sequence number that we're reading at?

In the optimizer, mark CTE clauses that modify a table that's already been modified in the same statement with a special flag
In the optimizer and execbuilder, in each marked CTE clause, further mark any operators that scan the table that's going to get modified for a second time
In the execution engine, for each marked scan operator, emit the scans with the special read mode. Then, we can error out if we notice that the keys have been written to already (or ignore the rows somehow if we want to get fancy).

@knz this intersects closely with related prior work you've done on halloween problem and sequence numbers, so I am curious for your thoughts on this problem and also this proposed strategy.

michae2 commented 3 years ago

Maybe we need a special read mode for scans that returns a special flag or condition if we see a write to a key at the same sequence number that we're reading at?

I wonder if this would return false positives for something like:

CREATE TABLE b (m INT PRIMARY KEY, n INT);
INSERT INTO b VALUES (10, 1), (20, 1), (30, -1), (40, -1);
WITH x AS (UPDATE b SET n = n + 1 WHERE n > 0 RETURNING n), y AS (UPDATE b SET n = n - 1 WHERE n < 0 RETURNING n) SELECT * FROM x;

These two UPDATE parts should modify different rows, but they will both have to scan all rows, so the second update will scan rows modified by the first.

michae2 commented 3 years ago

Thinking along the same lines, what if we try to use a special mode to catch it during the modification instead of during the scan? Instead of Put into the primary index, we could use CPut with the value the scan returned. Would that work? Or another idea is to add a CDel that verifies we're deleting the kv we think we are.

knz commented 3 years ago

~The semantics in the SQL standard (and what is implemented in pg) on this is pretty clear: Mutation CTEs must be executed sequentially and the internal nested txn sequence number must be increased after each mutation CTE.~

(edit: I was wrong - see jordan's answer below)

knz commented 3 years ago

It's the process of increasing the seqnum tht makes subsequent CTEs / the main statement able to observe the previous writes.

knz commented 3 years ago

I'm also somewhat surprised -- when we implemented nested txns originally, I though we tested this case? At that time at least, I think this used to work transparently, because CTEs were executed as separate statements and the planner reset for each statement was in charge of increasing the seqnum.

jordanlewis commented 3 years ago

The semantics in the SQL standard (and what is implemented in pg) on this is pretty clear: Mutation CTEs must be executed sequentially and the internal nested txn sequence number must be increased after each mutation CTE.

This is what I thought too, but it's not the case. All CTEs in a single statement, in Postgres, observe a single snapshot of the database. They do not read their own writes. This is easily verifiable in Postgres:

postgres=# create table test (a int);
CREATE TABLE
postgres=# with x as (insert into test values(1)) select * from test;
 a
---
(0 rows)

Postgres also is documented this way (https://www.postgresql.org/docs/current/queries-with.html):

The sub-statements in WITH are executed concurrently with each other and with the main query. Therefore, when using data-modifying statements in WITH, the order in which the specified updates actually happen is unpredictable. All the statements are executed with the same snapshot (see Chapter 13), so they cannot “see” one another's effects on the target tables. This alleviates the effects of the unpredictability of the actual order of row updates, and means that RETURNING data is the only way to communicate changes between different WITH sub-statements and the main query. An example of this is that in

knz commented 3 years ago

oh wow TIL

knz commented 3 years ago

how does pg ensure consistency in this case?

jordanlewis commented 3 years ago

Thinking along the same lines, what if we try to use a special mode to catch it during the modification instead of during the scan? Instead of Put into the primary index, we could use CPut with the value the scan returned. Would that work? Or another idea is to add a CDel that verifies we're deleting the kv we think we are.

Yes, I think this would work, and it's a lot more straightforward than what I proposed. The main complication would be plumbing.

how does pg ensure consistency in this case?

Postgres uses MVCC information to ignore rows that have already been written in the same statement. I don't fully understand the details, but you can see the effects by searching for TM_Invisible and TM_SelfModified in their codebase. Here is a relevant snippet:

https://github.com/postgres/postgres/blob/master/src/backend/executor/nodeModifyTable.c#L2024-L2040

knz commented 3 years ago

this looks like writing at the next seqnum and reading at the current one. Maybe we could implement this too.

jordanlewis commented 3 years ago

But I think the semantics that you are describing are the reason for the current anomaly. The issue is that if you read at seqnum X, and base your writes off of that data, but the data has been edited in seqnum X+1, you could be sending updates or dels that don't reflect reality - this situation is what causes the index corruption posted at the top of the issue.

knz commented 3 years ago

hm this will need some more thought on my side.

i do want to reflect on the solution you've proposed earlier:

In the execution engine, for each marked scan operator, emit the scans with the special read mode. Then, we can error out if we notice that the keys have been written to already (or ignore the rows somehow if we want to get fancy).

It's going to be particularly critical to ensure that any subsequent reader always executes on the gateway and does not get distributed. The KV layer does not support parallel operation using multiple nodes on a statement that interleaves reads and writes. (Unless @andreimatei can tell us this limitation has been lifted)

michae2 commented 3 years ago

Another version with DELETE / UPSERT, this one has the row missing from the secondary index:

CREATE TABLE a (i INT PRIMARY KEY, j INT, INDEX (j));
INSERT INTO a VALUES (0, 1);
WITH x AS (DELETE FROM a WHERE i = 0 RETURNING j), y AS (UPSERT INTO a VALUES (0, 1) RETURNING j) SELECT * FROM x;
SELECT i, j FROM a@primary;
SELECT i, j FROM a@a_j_idx;

andreimatei commented 3 years ago

In the execution engine, for each marked scan operator, emit the scans with the special read mode. Then, we can error out if we notice that the keys have been written to already (or ignore the rows somehow if we want to get fancy).

It's going to be particularly critical to ensure that any subsequent reader always executes on the gateway and does not get distributed. The KV layer does not support parallel operation using multiple nodes on a statement that interleaves reads and writes. (Unless @andreimatei can tell us this limitation has been lifted)

Right; writes cannot be distributed, and I think this implies that no part of a writing statement is distributed (because of how our planning works). FWIW, it's not clear to me that these distribution facts are relevant to whether or not a scan can check the seq nums of the same-txn writes that it sees.

michae2 commented 3 years ago

Thinking along the same lines, what if we try to use a special mode to catch it during the modification instead of during the scan? Instead of Put into the primary index, we could use CPut with the value the scan returned. Would that work? Or another idea is to add a CDel that verifies we're deleting the kv we think we are.

Yes, I think this would work, and it's a lot more straightforward than what I proposed. The main complication would be plumbing.

Oh, drat, I'm wrong, this wouldn't work. It's not enough to check that the values are what we expected, we can still run into the ABA problem. We actually need to check who wrote the current value.

knz commented 3 years ago

I still think we should understand what pg does for this.

In fact, I followed up on Jordan's suggestion above:

Postgres uses MVCC information to ignore rows that have already been written in the same statement. I don't fully understand the details, but you can see the effects by searching for TM_Invisible and TM_SelfModified in their codebase. Here is a relevant snippet: postgres/postgres@master/src/backend/executor/nodeModifyTable.c#L2024-L2040

I went to look there and discovered this: postgres returns an error if the same row is modified twice.

What would make it difficult to report an error in crdb for that case, instead of trying to invent new consistency semantics to perform the problematic kv operation?

In fact, I distinctly recall multiple iterations of a convo while we were designign UPSERT back two years ago, where we were pointing out that trying to be smarter than pg in that case was simply very difficult to implement correctly. It looks like we kind of forgot that conversation.

knz commented 3 years ago

Here's the historical convo:

https://docs.google.com/document/d/1EiHsFszQWuVkPPpw5YZns2lL4qdsWlicnm7Vht8e0E4/edit#heading=h.pb6dlf9aia4n

See "unanswered questions" near the top.

jordanlewis commented 3 years ago

I went to look there and discovered this: postgres returns an error if the same row is modified twice.

Not in the case of Common Table Expressions, in which case one of the mutations is accepted, and the others silently dropped. This is both documented and true from experimentation. But, the documentation also says that the results are "unpredictable" and "should be avoided", so presumably it might error in this case, and wouldn't be unreasonable for us to do so...

What would make it difficult to report an error in crdb for that case, instead of trying to invent new consistency semantics to perform the problematic kv operation?

Good question, what do you think? How could we use the seqnum abstractions to detect this case? Or would we have to do it from SQL?

knz commented 3 years ago

Not in the case of Common Table Expressions, in which case one of the mutations is accepted, and the others silently dropped.

Yes you've written this, but how does PG further ensures that the indexes remain consistent if some writes get dropped?

That's what I don't understand yet, and I think us understanding that will be key to deciding a sound approach.

ow could we use the seqnum abstractions to detect this case?

A flag on the KV batch to refuse a write if there's already an intent at the same seqnum. We'd set that flag for all SQL statements -- I can't see any current SQL statement that has legitimate business writing to the same key twice.

(we need a flag, instead of changing the default behavior, because we have other non-sql places in the code that rely on the ability to write to the same key twice at the same seqnum)

knz commented 3 years ago

Ok the idea to use a KV flag would not work in this case.

I had misread the trace and did not understand that we're seeing two separate KV InitPut side-by-side on different keys for the secondary index.

insert, initially:
1. primary put 0pk -> 0
2. secondary put 0 -> 0pk
upsert 1:
1. primary read 0pk -> 0
2. primary put 0pk -> 1
3. secondary del 0 -> 0pk
4. secondary put 1 -> 0pk
upsert 2:
1. primary read 0pk -> 0
2. primary put 0pk -> 2
3. secondary del 0 -> 0pk
4. secondary put 2 -> 0pk

The result:

primary and secondary del 0 is issued two times
two secondary puts for separate values (0,1) and (0,2) are issued side-by-side

I'm not sure we can do anything about the values (0,1) and (0,2), but I'm pretty sure we should not have idempotent value on Del: if the value has already been deleted once in the txn, we should not allow it to be deleted a 2nd time.

If the 2nd del (at step 6 or 10 above) got a KV error, we'd get the right behavior.

knz commented 3 years ago

@nvanbenschoten reminds us that we use the seqnums in two ways: reads read a the fixed seqnum at the start of the stmt, whereas differnet mutations inside the stmt increase the seqnum.

We do have already duplicate write detection inside KV at the same seqnum, however it's possible that it's not active here because the seqnum is increased for each mutation.

knz commented 3 years ago

@nvanbenschoten and I have formulated a similar idea:

knz: maybe the coondition to detect a duplicate kv write is not whether there's an intent at the same seqnum but rather whether there's one in the range [read seqnum, current write seqnum]

nathan: I was thinking along those lines as well. We could introduce some kind of API to reject writes that observe intents in the interval (read seqnum, current write seqnum) (notice the bounds). That would still preserve idempotency for a given mutation. we’d also need to ship the read seqnum on batches, which we don’t currently do.

knz commented 3 years ago

@jordanlewis how do you envision us to strategize the work on this?

nathan and I still think it would be worth doing a deep investigation into pg (probably using the sources) to understand exactly what happens there
if we need to do something about MVCC writes and the seqnums, this will need a cross-layer prototype at least to check the idea is sound and nothing else breaks

knz commented 3 years ago

More comments from Nathan

Part of what I mean by this is getting a deep enough understanding of PG to make the determination of whether this is necessary and sufficient and also what the base case of this should be and whether it should be enabled at all times. Are there ever legal reasons to touch the same row twice in a statement? Also, does the default case when write seqnum == read seqnum + 1 result in an empty “invalid seqnum interval”? Is it worth optimizing this case?

One thing that I liked about how you and Andrei originally worked on savepoints is that you started with a set of test cases. That would be equally useful here.

Also, does this obviate the need for duplicate row detection for upsert statements?

knz commented 3 years ago

For reference, this is an unexpected variant of #28842.

nvanbenschoten commented 3 years ago

We could introduce some kind of API

I'm wondering whether a new API exposed from KV to SQL is even needed. We already have the kv.SteppingMode API that SQL configures when it wants a consistent read snapshot across the execution of a single statement. If KV began rejecting all writes to the same key in the same SQL "step" when SteppingEnabled is set, would that be ok? Would this cause other problems? To this point, we've never been strict about not touching the same key twice when not strictly necessary during the execution of a single SQL statement.

RaduBerinde commented 3 years ago

This is the kind of thing I'd expect TLP or sqlsmith to be able to find. Perhaps we should make sqlsmith generate queries that check index consistency?

knz commented 3 years ago

Upon request by Nathan, I went to reverse engineer the pg sources.

My first finding somewhat invalidates the idea above to reject a 2nd write if it's in the interval [readSeqnum, writeSeqNum]. I found this around the same code that @jordanlewis has identified above.

We see for deletes:

                 * The target tuple was already updated or deleted by the
                 * current command, or by a later command in the current
                 * transaction.  The former case is possible in a join DELETE
                 * where multiple tuples join to the same target tuple. This
                 * is somewhat questionable, but Postgres has always allowed
                 * it: we just ignore additional deletion attempts.

and updates

                 * The target tuple was already updated or deleted by the
                 * current command, or by a later command in the current
                 * transaction.  The former case is possible in a join UPDATE
                 * where multiple tuples join to the same target tuple. This
                 * is pretty questionable, but Postgres has always allowed it:
                 * we just execute the first update action and ignore
                 * additional update attempts.

The code around this also teaches me how postgres does the detection of multiple MVCC writes.

For context, postgres associates a "command ID" (CID) to each "command", which approximately corresponds to statements (see below).

Then it stores, alongside each new tuple (== intent in our case), a high water mark of the CID that wrote the value.

A MVCC write is ignored without error if the CID of the current command is equal to the high water mark in the target tuple (intent). That is, the last write to the tuple was performed by the current command.

If the MVCC write finds another CID than the current command, the error "row already modified" is reported.

How is the CID assigned. There's at least one new CID for each "macro" step in a txn, so at least one per client statement. I am not yet able to find whether a CID is assigned per CTE. It's not clear there is one.

knz commented 3 years ago

So in the crdb situation, the (theoretical) behavior that would be equivalent to pg is this:

introduce a new "statement ID" alongside the write seqnum.
store that inside intents.
don't increment it automatically inside KV, unlike the write seqnum.
ignore duplicate KV ops at the same statement ID
have SQL populate the "statement ID" in KV requests
would be increased in-between statements.

However, before we actually do this, I'd like first to understand how pg prevents the problem above. If it ignores duplicate writes using the same CID, how does it prevent mistaken index tuples being added in the scenario above?

knz commented 3 years ago

notably, the behavior for INSERT ON CONFLICT is different from that for UPDATE and DELETE:

             * This is somewhat similar to the ExecUpdate() TM_SelfModified
             * case.  We do not want to proceed because it would lead to the
             * same row being updated a second time in some unspecified order,
             * and in contrast to plain UPDATEs there's no historical behavior
             * to break.
             *
             * It is the user's responsibility to prevent this situation from
             * occurring.  These problems are why SQL-2003 similarly specifies
             * that for SQL MERGE, an exception must be raised in the event of
             * an attempt to update the same row twice.

in that case, we get the error "ON CONFLICT DO UPDATE command cannot affect row a second time"

There is no check on the CID and no exception like for UPDATE and DELETE.

knz commented 3 years ago

HEre's an exception to the INSERT ON CONFLICT exception:

    /*
     * Note that it is possible that the target tuple has been modified in
     * this session, after the above table_tuple_lock. We choose to not error
     * out in that case, in line with ExecUpdate's treatment of similar cases.
     * This can happen if an UPDATE is triggered from within ExecQual(),
     * ExecWithCheckOptions() or ExecProject() above, e.g. by selecting from a
     * wCTE in the ON CONFLICT's SET.
     */

This comment clarifies how pg implements INSERT ON CONFLICT: by running the plain UPDATE code on the conflict row after it's been "locked".

Because it runs the same code as UPDATE, it is subject to the same transparent "no error if same CID" behavior described above. Therefore, it's possible for an ON CONFLICT clause to update a tuple that's been updated by another UPDATE in a CTE earlier, without error.

knz commented 3 years ago

I think I understand how pg prevents the index corruption:

postgres does not enable parallel executions when a statement contains mutations. So CTEs do not execute concurrently with each other, and do not interleave.
in delete and update, each row is processed in turn. The tuple modification and the corresponding index update are skipped if the "main" tuple is detected to have been written earlier by the same stmt.
in regular insert (NOT upsert / on conflict), also each row is processed in turn, and a regular conflict error is raised if the row already exists. The index update is skipped in that case.
in upsert (insert on conflict), each row is processed in turn, and a special check is implemented to ensure that an error is raised if the target row has already been modified inside the command (same CID). If the error is raised, the index update does not occur either.

knz commented 3 years ago

To come back to solution-making.

So in any case we'll need at least two modes, and have the mode set during SQL planning depending on statement type.

The idea nathan and I had earlier could still be applied to INSERT ON CONFLICT / UPSERT. However, it would not be applicable to DELETE and UPDATE.

For DELETE/UPDATE, we'd need two things:

a way to ask KV to return an extra bit of information in results when there's already an intent: whether the intent is for the same CID or not.
a way to "serialize" the decision to perform the KV Puts/Dels/CPuts for the secondary indexes as a function of the result of the Put/Del/CPut on the PK.

Functionally, we can achieve this by issueing the requests as two subsequent batches, and waiting for the result from the first before sending the second one. However this makes me sad, as it would make us lose the fast path.

So maybe we can have some kind of marker in the batch, "if this command encounters an intent at the same CID, ignore the remainder of the batch and return success". This would be a new type of MVCC operation.

michae2 commented 3 years ago

Here's one idea for a short-term mitigation: https://github.com/cockroachdb/cockroach/pull/70976

knz commented 3 years ago

Regarding my last comments. If we do something similar as PG it's possible we don't need to introduce a new concept of CID, and we can use the seqnums again. The condition "CID in intent == current CID" is equivalent to Nathan and my idea "read seqnum <= intent seqnum <= write seqnum"

knz commented 3 years ago

I worked a little today to find out what specific combination of SQL is affected by this bug.

The analysis goes as follows:

The problem exists when a statement issues a KV operation that does a blind write (Put, Del) and makes an assumption on the current state based on previous reads at the read seqnum, not taking into account other KV writes in the same SQL statement at a later seqnum.

What does this situation mean at the level of SQL?

In SQL, blind writes occur in DELETE, INSERT and in the index updates by UPDATE and UPSERT. Of these, only INSERT does not make an assumption on the current state, because it also issues a CPut for every row inserted.

All the other mutations are sensitive to prior state, and therefore all combinations thereof on shared indexes are affected. (It appears that two subsequent DELETEs cannot corrupt indexes, because they are idempotent on their writes, however the result of their RETURNING clauses would diverge from postgres because of the internal incorrect behavior.)

I am updating the title of this issue accordingly.

Here is an example using only UPDATE:

> create table t(x int primary key, y int, index(y));
> insert into t values (0,0);
> with 
  x as (update t set x = 1, y = 1 returning y), 
  y as (update t set x = 2, y = 2 returning y)
  select * from t;
  x | y
----+----
  0 | 0
(1 row)

>  select * from t@primary;
  x | y
----+----
  1 | 1
  2 | 2
(2 rows)

Oops! the two UPDATEs of 1 row created another row. This is because each UPDATE uses a blind Put on the PK, failing to see that another UPDATE operated on the same row before.

knz commented 3 years ago

Something we haven't done so far is look at the RETURNING clauses.

There's an "interesting" behavior in PostgreSQL that's an artifact of how pg creates query plans.

For example, the case with two UPDATEs in postgres:

kena=> create table t(x int primary key, y int); create index on t(y);
CREATE TABLE
CREATE INDEX

kena=> insert into t values (0,0);
INSERT 0 1

kena=> with
  a as (update t set x = 1, y = 1 returning y),
  b as (update t set x = 2, y = 2 returning y)
  table a union all table b;
 y
---
 1
(1 row)

kena=> select * from t;
 x | y
---+---
 1 | 1
(1 row)

Now, compare:

kena=> delete from t;
DELETE 1
kena=> insert into t values (0,0);
INSERT 0 1
kena=> with
  a as (update t set x = 1, y = 1 returning y),
  b as (update t set x = 2, y = 2 returning y)
  table t;
 x | y
---+---
 0 | 0
(1 row)

kena=> select * from t;
 x | y
---+---
 2 | 2
(1 row)

What's going on? The result seems to differ depending on the SELECT clause in the main part of the query?

So let's look at the 1st case. Even though both UPDATE statements read a snapshot, the 2nd UPDATE "sees" the changes by the first and doesn't do anything - we only get the first update effected.

How does it do this? As per my source code analysis above, pg looks at the MVCC result of the 2nd DEL (in the 2nd UPDATE). Because there was already a DEL inside the same stmt, the UPDATE of that row is short-circuited with no error (i.e. nothing happens - the subsequent PUT on PK and secondary indexes are not even issued).

In the 2nd case, the same happens, except that the 2nd UPDATE is executed first.

The reason for the re-ordering is that in the 1st case, the UNION ALL output creates a sequential dependency (scanning x, and then scanning y), which forces the 1st UPDATE in the syntax to also be executed 1st.

In the 2nd case, there is no UNION ALL and the two CTEs are not ordered. They are simply ordered in the linked list order in the plan, which happens to be back-to-front.

I am no sure how much of this behavior we'll want/need to replicate in crdb.

michae2 commented 2 years ago

The limitation here is that we have disallowed multiple modification subqueries of same table until this is fixed.

knz commented 2 years ago

I've found an interesting other problem, which happens even with the workaround in place.

This also needs to be documented as known limitation: if multiple mutations inside the same statement affect different tables with FK relations and ON CASCADE clauses between them, the results will be different between pg and crdb.

For example, take the following starting point:

=> create table t(x int primary key);
=> create table u(x int primary key references t(x) on delete cascade on update cascade, y int);
=> insert into t(x) values (1);
=> insert into u(x,y) values (1,2);

Now look what happens between crdb and pg.

pg:

=> with 
 a as (insert into u(x,y) values (1,1) on conflict(x) do update set y=excluded.y returning x), 
 b as (update t set x=2 where true returning x) 
table t;

 x
---
 1
(1 row)

=> table u;
 x | y
---+---
 2 | 1
(1 row)

Meanwhile, crdb:

=> with 
 a as (insert into u(x,y) values (1,1) on conflict(x) do update set y=excluded.y returning x), 
 b as (update t set x=2 where true returning x) 
table t;

  x
-----
  1
(1 row)
(error encountered after some results were delivered)
ERROR: insert on table "u" violates foreign key constraint "u_x_fkey"
SQLSTATE: 23503
DETAIL: Key (x)=(1) is not present in table "t".
CONSTRAINT: u_x_fkey

I'm not yet 100% sure that FK + ON CASCADE is not bypassing the workaround that Michael put in place. Could it be possible that FK + ON CASCADE could still cause corruption?

michae2 commented 2 years ago

Wow, good catch @knz! The guardrail I added does not take FK cascades into account.

I spent a while tonight trying to figure out if FK cascades could cause this kind of corruption. It looks like cascades always execute as postqueries, and they can observe the writes that came before them: https://github.com/cockroachdb/cockroach/blob/f032fe85e5ea4cc4c941a4ee8a704b3d2b1eae63/pkg/sql/distsql_running.go#L1478-L1488

So fortunately I don't think a cascade can cause this kind of corruption (we need the mutations to not see each other's writes).

FK checks also execute as postqueries, which I think explains the behavior you demonstrated. The insert on conflict queues up a FK check that t(x)=1, but does not execute it until after the update, and then it fails. Maybe this means postqueries aren't exactly the right time to execute FK checks? I'm curious how PG is handling this.

mgartner commented 2 years ago

Would relying on intents to detect previous writes work for multi-family tables? For example, consider the corruption in the example below:

statement ok
SET CLUSTER SETTING sql.multiple_modifications_of_table.enabled = true;

statement ok
CREATE TABLE t (
  k INT PRIMARY KEY,
  a INT,
  b INT,
  INDEX a_b_idx (a, b),
  FAMILY (k, a),
  FAMILY (b)
)

query T kvtrace
INSERT INTO t VALUES (333, 444, 555)
----
CPut /Table/108/1/333/0 -> /TUPLE/2:2:Int/444
CPut /Table/108/1/333/1/1 -> /INT/555
InitPut /Table/108/2/444/555/333/0 -> /BYTES/

query T kvtrace
WITH
  x AS (UPDATE t SET a = 666 RETURNING k),
  y AS (UPDATE t SET b = 777 RETURNING k)
SELECT * FROM t
----
Scan /Table/108/{1-2}
Put /Table/108/1/333/0 -> /TUPLE/2:2:Int/666
Del /Table/108/2/444/555/333/0
CPut /Table/108/2/666/555/333/0 -> /BYTES/ (expecting does not exist)
Scan /Table/108/{1-2}
Put /Table/108/1/333/1/1 -> /INT/777
Del /Table/108/2/444/555/333/0
CPut /Table/108/2/444/777/333/0 -> /BYTES/ (expecting does not exist)
Scan /Table/108/{1-2}

query III
SELECT * FROM t@primary
----
333  666  777

# NOTICE: a_b_idx has two keys instead of one, and neither are correct.
query II
SELECT a, b FROM t@a_b_idx
----
444  777
666  555

How could the Put /Table/108/1/333/1/1 detect an intent on the key Put /Table/108/1/333/0 to determine that the row had been written to previously?

I suppose the intent from the first Del /Table/108/2/444/555/333/0 could be seen by the second Del /Table/108/2/444/555/333/0?

cockroachdb / cockroach

sql: multiple CTEs with mutations on same row can cause inconsistency #70731