sql: transactions containing certain DDL statements are not atomic

knz commented 5 years ago

Related to #26508, #24785. Filing this in reaction to a recent thread with @mberhault

Today, CockroachDB will accept to process DDL statements inside explicit txns (BEGIN...COMMIT), but process them after the non-DDL statements have been committed successfully.

If the DDL subsequently fails, the transaction is then both partially committed and partially rolled back. i.e. an atomicity violation.

If/when this happens, is currently reported to the client upon COMMIT via the special error code XXA00 (TransactionCommittedWithSchemaChangeFailure)

There is a discussion to be had about whether the situation should be allowed to occur at all, but until it does this issue will exist so it can be linked from the source code and docs for explanatory purposes.

Jira issue: CRDB-5394

knz commented 5 years ago

Proposing a session flag strict_ddl_atomicity to control this in PR #42063

knz commented 5 years ago

Perhaps related (but unlikely): #42064

knz commented 5 years ago

I created an inventory of all the currently recognized DDL and how they are affected by this error:

DDL	Sync part	Async part	Async may fail and yield XXA00?	Constraint-like DDL which can be VALIDATEd
CREATE USER	insert row	none	N/A	N/A
CREATE ROLE	insert row	none	N/A	N/A
CREATE VIEW	create desc, non-public	publish desc	no	N/A
CREATE SEQUENCE	create desc, non-public	publish desc	no	N/A
CREATE TABLE, no constraints	create desc, non-public	publish desc	no	N/A
CREATE TABLE, with constraints	create desc, non-public + constraints immediately active	publish desc	no	N/A
CREATE TABLE AS (see note below)	create desc, non-public + insert	publish desc	no	N/A
CREATE INDEX (non-unique)	create desc, non-public	index backfill, then publish desc	no	yes (index active/inactive)
CREATE UNIQUE INDEX	create desc, non-public	index backfill, then publish desc	yes (duplicate rows)	yes (index active/inactive + UNIQUE constraint)
CREATE STATISTICS	define a "create stats" job	none	no	N/A
TRUNCATE TABLE	mark desc as dropped, create new desc	unpublish old desc, publish new desc	no	N/A
DROP USER	delete row	none	N/A	N/A
DROP ROLE	delete row	none	N/A	N/A
DROP DATABASE RESTRICT	delete row	none	N/A	N/A
DROP VIEW RESTRICT	unpublish	delete desc	no	N/A
DROP TABLE RESTRICT	unpublish	delete desc	no	N/A
DROP INDEX RESTRICT	unpublish	delete desc	no	N/A
DROP SEQUENCE RESTRICT	unpublish	delete desc	no	N/A
DROP VIEW CASCADE	unpublish + drop dependencies	delete desc + dependency async	same as deps	N/A
DROP TABLE CASCADE	unpublish + drop dependencies	delete desc + dependency async	same as deps	N/A
DROP INDEX CASCADE	unpublish + drop dependencies	delete desc + dependency async	same as deps	N/A
DROP SEQUENCE CASCADE	unpublish + drop dependencies	delete desc + dependency async	same as deps	N/A
DROP DATABASE CASCADE	delete db row + drop dependencies	dependency async	same as deps	N/A
ALTER TABLE RENAME	update desc in-RAM	publish updated desc	no	N/A
ALTER TABLE RENAME COLUMN	update desc in-RAM	publish updated desc	no	N/A
ALTER TABLE RENAME CONSTRAINT	update desc in-RAM	publish updated desc	no	N/A
ALTER TABLE DROP COLUMN	create mutation, non-public	publish mutation	no	N/A
ALTER TABLE DROP CONSTRAINT	create mutation, non-public	publish mutation	no	N/A
ALTER TABLE ALTER DROP DEFAULT	create mutation, non-public	publish mutation	no	N/A
ALTER TABLE ALTER DROP NOT NULL	create mutation, non-public	publish mutation	no	N/A
ALTER TABLE ALTER DROP STORED	create mutation, non-public	publish mutation	no	N/A
ALTER TABLE ALTER TYPE (no conv)	create mutation, non-public	publish mutation	no	N/A
ALTER TABLE AUDIT SET	create mutation, non-public	publish mutation	no	N/A
ALTER TABLE ADD COLUMN NULL	create mutation, non-public	publish mutation	no	N/A
ALTER TABLE ADD CONSTRAINT NOT NULL NOT VALID	create non-public mutation, mark constraint as invalid	publish desc	no	N/A
ALTER TABLE ADD CONSTRAINT DEFAULT NOT VALID	create non-public mutation, mark constraint as invalid	publish desc	no	N/A
ALTER TABLE ADD CONSTRAINT UNIQUE NOT VALID	create non-public mutation, mark constraint as invalid	publish desc	no	N/A
ALTER TABLE ADD CONSTRAINT CHECK NOT VALID	create non-public mutation, mark constraint as invalid	publish desc	no	N/A
ALTER TABLE ADD CONSTRAINT FK NOT VALID	create non-public mutation, mark constraint as invalid	publish desc	no	N/A
ALTER TABLE ADD COLUMN DEFAULT	create mutation, non-public	column backfill, then publish desc	yes (scalar eval fails)	NO (but see ALTER ADD CONSTRAINT SET DEFAULT)
ALTER TABLE ADD CONSTRAINT NOT NULL	create mutation, non-public	column validation, then publish desc	yes (null row)	yes
ALTER TABLE ADD CONSTRAINT DEFAULT	create mutation, non-public	column backfill, then publish desc	yes (scalar eval fails)	yes
ALTER TABLE ADD CONSTRAINT UNIQUE	create mutation + index, non-public	index backfill, then publish desc	yes (dup row)	yes
ALTER TABLE ADD CONSTRAINT CHECK	create mutation, non-public	check validation, then publish desc	yes (check fails)	yes
ALTER TABLE ADD CONSTRAINT FK	create mutation + index, non-public	index backfill + FK validation, then publish desc	yes (unique error, row not found)	yes
ALTER TABLE ALTER SET DEFAULT	create mutation, non-public	column backfill, then publish desc	yes (scalar eval fails)	NO (!!!)
ALTER TABLE ALTER TYPE (w/ data conv)	create mutation, non-public	background data conversion, then publish desc	yes (conversion fails)	NO (!!!)
ALTER TABLE PARTITION BY	create mutation, non-public	perform partitioning, then publish desc	no (?)	N/A (?)
ALTER TABLE VALIDATE CONSTRAINT	run validation in-line	none	N/A	N/A
ALTER TABLE SPLIT AT	issue admin split command	none	N/A	N/A
ALTER TABLE RELOCATE	issue admin relocate command	none	N/A	N/A
ALTER TABLE UNSPLIT AT	issue admin unsplit command	none	N/A	N/A
ALTER TABLE UNSPLIT ALL	issue admin unsplit command	none	N/A	N/A
ALTER TABLE SCATTER	issue admin scatter command	none	N/A	N/A
ALTER TABLE INJECT STATISTICS	delete + insert row in system table	none	N/A	N/A
ALTER TABLE CONFIGURE ZONE	update row in system table	none	N/A	N/A
ALTER SEQUENCE	update desc in-RAM	publish updated desc	no	N/A
ALTER INDEX RENAME	update desc in-RAM	publish updated desc	no	N/A
ALTER INDEX SPLIT AT	issue admin split command	none	N/A	N/A
ALTER INDEX RELOCATE	issue admin relocate command	none	N/A	N/A
ALTER INDEX UNSPLIT AT	issue admin unsplit command	none	N/A	N/A
ALTER INDEX UNSPLIT ALL	issue admin unsplit command	none	N/A	N/A
ALTER INDEX SCATTER	issue admin scatter command	none	N/A	N/A
ALTER INDEX CONFIGURE ZONE	update row in system table	none	N/A	N/A

Note about CREATE TABLE AS: this statement processes the insert asynchronously only when executed outside an explicit txn.

knz commented 5 years ago

@andy-kimball in a separate thread suggests that every "long running" DDL should be considered as split between a sync part and an async part, with the understanding that the async part may fail.

A key part of this proposal is that a client must be able to verify whether the async part has completed successfully.

When the complex DDL is related to a constraint, this is trivially possible, as both pg and crdb have a VALIDATE CONSTRAINT statement for this specific purpose. This can be even extended to CREATE UNIQUE INDEX, which can be understood as CREATE INDEX + ADD CONSTRAINT UNIQUE.

However as per the table in my previous comment above, the following statements do not have a clear way for clients to control completion:

ALTER TABLE ADD COLUMN with DEFAULT, and ALTER TABLE ALTER COLUMN SET DEFAULT.
ALTER TABLE ALTER COLUMN SET TYPE with data conversion

knz commented 5 years ago

For reference here is what pg 12 has to say about ALTER TABLE SET TYPE, ADD DEFAULT and other ALTER forms that require a table rewrite: https://www.postgresql.org/docs/12/mvcc-caveats.html

Some DDL commands, currently only TRUNCATE and the table-rewriting forms of ALTER TABLE, are not MVCC-safe. This means that after the truncation or rewrite commits, the table will appear empty to concurrent transactions, if they are using a snapshot taken before the DDL command committed. This will only be an issue for a transaction that did not access the table in question before the DDL command started — any transaction that has done so would hold at least an ACCESS SHARE table lock, which would block the DDL command until that transaction completes. So these commands will not cause any apparent inconsistency in the table contents for successive queries on the target table, but they could cause visible inconsistency between the contents of the target table and other tables in the database.

knz commented 5 years ago

For reference here is what MSSQL has to say about online changes, including type changes and new columns: https://docs.microsoft.com/en-us/sql/t-sql/statements/alter-table-transact-sql?view=sql-server-ver15

Online alter column allows user created and autostatistics to reference the altered column for the duration of the ALTER COLUMN operation, which allows queries to run as usual. At the end of the operation, autostats that reference the column are dropped and user-created stats are invalidated. The user must manually update user-generated statistics after the operation is completed. If the column is part of a filter expression for any statistics or indexes then you can't perform an alter column operation. While the online alter column operation is running, all operations that could take a dependency on the column (index, views, and so on.) block or fail with an appropriate error. This behavior guarantees that online alter column won't fail because of dependencies introduced while the operation was running.

The language is a bit obfuscated but this says (in other words) that ALTER COLUMN SET TYPE really adds a new column to perform the change, then converts, then drops the original column. Until/unless the conversion completes, concurrent txns either continue to observe the original column (until the DDL-containing txn commits) or get suspended while waiting on a lock for the new column (after the DDL-containing txn logically commits).

Similarly for new columns, the new column remains invisible until the backfill completes.

(It's not clear how clients are meant to observe whether the backfill completed successfully or not.)

knz commented 5 years ago

As per today's SIG notes, @lucy-zhang and @dt to review my analysis above (including the complete DDL table).

knz commented 5 years ago

Separately, regarding Andy's proposal to avoid an error in COMMIT even when the async DDL fails, @rafiss contributes:

it may not be clear to the user or schema migration tool that a successful COMMIT can leave the constraint in a NOT VALIDATED state. The risk to the user is that they continue adding data to the table, and they think it satisfies the constraint, but in fact more constraint violations pile up, making it even harder to fix the data and validate the constraint later (and that's if anyone ever realizes they need to do that). I don't really think it would be easy to clarify this with user education alone -- there are so many tools that exist already that assume a successful commit means that the constraints added in the transaction were applied and validated successfully. The fundamental issue is that, to me, using unvalidated constraints by default seems like an extremely significant departure from what users may expect, especially those expecting lots of Postgres compatibility. Correction: Radu pointed out to me that newly inserted data will actually be validated against the constraint even when it is in the NOT VALIDATED state. But, still I have concerns about leaving the constraint NOT VALIDATED in a way that would be hard for a user to ever notice.

knz commented 5 years ago

Example Go ORM affected: http://github.com/gobuffalo/pop especially https://github.com/gobuffalo/pop/blob/master/migrator.go#L93

Will yield invalid ORM migration behavior if error XXA00 occurs.

bdarnell commented 5 years ago

The fundamental issue is that, to me, using unvalidated constraints by default seems like an extremely significant departure from what users may expect, especially those expecting lots of Postgres compatibility.

Yeah, this is the key, and suggests another path forward: require the NOT VALID modifier on all constraint additions in transactions, turning all such changes into explicit two-step operations to add the invalid constraint and then validate it.

knz commented 5 years ago

suggests another path forward: require the NOT VALID modifier on all constraint additions in transactions

This works for constraints but does not help with ALTER SET TYPE and ADD COLUMN DEFAULT (and add computed column).

For DEFAULT columns maybe we can engineer a restriction that only scalar expressions that are guaranteed to never fail are allowed. That'd be hard and of very dubious UX. Also it would be much harder / impossible to engineer this type of restriction for computed columns.

And we're still left in the blue for SET TYPE.

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

thoszhang commented 5 years ago

Are unique constraints/indexes included in the category of "constraints"? It's hard to see what the equivalent of an "unvalidated"/NOT VALID state is for a unique constraint, basically because a unique constraint is global (whereas check/FK constraints are not).

bdarnell commented 5 years ago

Are unique constraints/indexes included in the category of "constraints"? It's hard to see what the equivalent of an "unvalidated"/NOT VALID state is for a unique constraint, basically because a unique constraint is global (whereas check/FK constraints are not).

In postgres, a CREATE UNIQUE INDEX CONCURRENTLY command can leave behind an INVALID index: https://www.postgresql.org/docs/9.1/sql-createindex.html. It is forbidden to use CREATE (UNIQUE) INDEX CONCURRENTLY in an explicit transaction.

thoszhang commented 5 years ago

So an INVALID index in Postgres can either be (1) dropped and recreated, or (2) "rebuilt", during which the table is offline/locked. Since we can't support (2), what would be the advantage of us keeping an INVALID index in the logical schema at all, if all a user can do is drop it? (Also, unlike Postgres, since we presumably wouldn't support REBUILD, there would also be no point in continuing to keep the index updated on new writes. (edit: Actually, there would be a "point" in that we'd check for uniqueness violations on some subset of the rows. Is that useful?))

I guess the other question is whether we've considered supporting DDL statements (or mixed DDL/DML transactions) that do require the table to be offline, in order to solve some of these problems.

edit 2: I've been assuming throughout that unique constraints are always backed by unique indexes, and uniqueness violations manifest as key collisions in the index; if I'm wrong, please correct me.

andy-kimball commented 5 years ago

I think it's still somewhat useful to keep checking for uniqueness violations on some subset. At least things are not getting worse. That said, I'm not as opinionated on what to do with the index once the backfill job has failed. More important is being very deliberate and explicit about separating long-running O(# rows) operations into a sync and an async phase, and then ensuring that the sync phase is fully transactional and predictable (and documenting that).

Unique constraints don't always need to be enforced via index key collisions. See https://github.com/cockroachdb/cockroach/issues/41535 for an example where they're not.

knz commented 5 years ago

Are unique constraints/indexes included in the category of "constraints"? It's hard to see what the equivalent of an "unvalidated"/NOT VALID state is for a unique constraint, basically because a unique constraint is global (whereas check/FK constraints are not).

It seems to me that it's clear what is going on: if it's unvalidated, that means it's inactive and there can still be duplicate rows.

Also UNIQUE can be handled via the constraint system (the SQL standard and pg's dialect have all unique constraints receive a name in the constraint table, so we can attach state to it).

thoszhang commented 5 years ago

Are unique constraints/indexes included in the category of "constraints"? It's hard to see what the equivalent of an "unvalidated"/NOT VALID state is for a unique constraint, basically because a unique constraint is global (whereas check/FK constraints are not).

It seems to me that it's clear what is going on: if it's unvalidated, that means it's inactive and there can still be duplicate rows.

But the guarantee provided by "unvalidated" for FK/check constraints is stronger: it means that the constraint is enforced for all writes going forward. This is not the case for unvalidated/invalid unique constraints with a partially built index, because we can't guarantee that a new row is actually globally unique in the table, even if we can guarantee that the row is unique across all rows updated since the constraint was added. This distinction seems important to me, but maybe it's unimportant for users in practice, as long as we flag the constraint as unvalidated/invalid.

That's a relatively minor point. I'm mostly wondering about how well this plan (having the transaction succeed if the synchronous portion of the schema change succeeds, and alerting failures during the async phase via another channel) will actually work for users' most common use cases, especially when they're using migration tools and ORMs. Will they need to manually monitor their migration to see if it succeeded?

knz commented 5 years ago

I'm mostly wondering about how well this plan (having the transaction succeed if the synchronous portion of the schema change succeeds, and alerting failures during the async phase via another channel) will actually work for users' most common use cases, especially when they're using migration tools and ORMs.

I am wondering about this too. I have developed this idea above at the urging of @andy-kimball (who outlined this is the choice made in MSSQL, which, in contrast to pg, has designed for online schema changes, and so is a good example to follow), but @rafiss pointed out rightly that we advertise pg compat and pg is certainly different, so the migration story would be difficult.

At this point I don't know that we have a clear preference yet. Ben and Tobias are floating technical approaches that keep pg compatibility, and I suspect either yourself or David will have some opinions on this too.

I think the next step forward now is to:

1) document/clarify the current behavior for users 2) (optionally) move forward with #42063 if there's enough interest 3) develop a tech note / RFC with the various alternative ways forward that have been discussed so far, and weigh the pros and cons for each.

ajwerner commented 4 years ago

This lack of atomicity has come up in several conversations today so I figured I'd dump what I recall of those conversations. The proposal here was formulted during a conversation with @nvanbenschoten. The conversations focused on adding and removing columns in transactions, the DMLs which alter the physical layout of the primary index.

Problem as I'm seeing it:

When dropping a column, the column is logically dropped immediately when the transaction which is performing the drop commits; the column goes from PUBLIC->WRITE_ONLY->DELETE_ONLY->DROPPED. The important thing to note is that the writing transaction uses its provisional view of the table descriptor. The following is fine:

root@127.179.54.55:40591/movr> CREATE TABLE foo (k STRING PRIMARY KEY, c1 INT, c2, INT);
root@127.179.54.55:40591/movr> BEGIN;
root@127.179.54.55:40591/movr  OPEN> ALTER TABLE foo DROP COLUMN c2;
ALTER TABLE

root@127.179.54.55:40591/movr  OPEN> SELECT * FROM foo;
  k | c1
+---+----+
(0 rows)

root@127.179.54.55:40591/movr  OPEN> ALTER TABLE foo DROP COLUMN c1;
ALTER TABLE

root@127.179.54.55:40591/movr  OPEN> SELECT * FROM foo;
  k  
+---+
(0 rows)

root@127.179.54.55:40591/movr  OPEN> COMMIT;

When the above transaction commits, foo is in version 2 and has pending mutations to drop the columns c1 and c2. One thing to note is that transactions which happen after the above transaction may still observe those columns due to schema leasing. Another thing to note is that under versions 1 and 2 of foo, the physical layout has not change but the logical layout. When version 3 (DELETE_ONLY) comes in to effect, the physical layout may differ from version 1. It's critical that by the time the physical layout begins to change we know that no transactions are in version 1.

Now let's look at the ADD COLUMN case.

root@127.179.54.55:40591/movr> CREATE TABLE foo (k STRING PRIMARY KEY);
root@127.179.54.55:40591/movr> BEGIN;
root@127.179.54.55:40591/movr  OPEN> ALTER TABLE foo ADD COLUMN c1 INT;
ALTER TABLE

Time: 2.285371ms

root@127.179.54.55:40591/movr  OPEN> SELECT * FROM foo;
  k
+---+
(0 rows)

Time: 526.967µs

root@127.179.54.55:40591/movr  OPEN> ALTER TABLE foo ADD COLUMN c2 INT;                                                                                                                                                                                                                                                     
ALTER TABLE

Time: 3.795652ms

root@127.179.54.55:40591/movr  OPEN> SELECT * FROM foo;
  k
+---+
(0 rows)

Time: 516.892µs

root@127.179.54.55:40591/movr  OPEN> COMMIT;

root@127.179.54.55:40591/movr> SELECT * FROM foo;

   k   |  c1  |  c2
+------+------+------+
(0 rows)

How bizarre is that! In the DROP COLUMN case we observe the effect immediately, within the transaction. In the ADD COLUMN case we don't observe the effect during the transaction.

The reason for this is relatively straightforward: in order to observe and act on the effect of the ADD COLUMN we would need the physical layout of the table at the timestamp at which this transaction commits to differ from the physical layout of the table when the transaction began. In order for the schema change protocol to maintain correctness, it must be the case that for any given point in time (MVCC time) that a table has exactly one physical layout. We achieve this through the use of the invariant that at most there are two versions of a table leased at a time: the latest version and the preceding version. Between two versions of a table descriptor either the logical layout or the physical layout may change but never both. If both the logical and physical layout changed in one version step, lessees of the older version may not know how to interpret table data. Put differently, by the time a physical layout changes, all lessees of the table are guaranteed to know how to interpret the new physical layout.

Goal

We'd like to permit the following which works in postgres:

postgres=# BEGIN;                                                                                                                                                                                                                                                                                                           
BEGIN
postgres=# ALTER TABLE foo ADD COLUMN c1 INT NOT NULL DEFAULT 1;                                                                                                                                                                                                                                                            
ALTER TABLE
postgres=# INSERT INTO foo VALUES ('a', 1);                                                                                                                                                                                                                                                                                 
INSERT 0 1
postgres=# SELECT * FROM foo;                                                                                                                                                                                                                                                                                               
 k | c1
---+----
 a |  1
(1 row)

postgres=# ALTER TABLE foo DROP COLUMN c1;                                                                                                                                                                                                                                                                                  
ALTER TABLE
postgres=# ALTER TABLE foo ADD COLUMN c2 INT NOT NULL DEFAULT 1;
ALTER TABLE
postgres=# INSERT INTO foo VALUES ('b', 2);                                                                                                                                                                                                                                                                                 
INSERT 0 1
postgres=# SELECT * FROM foo;
 k | c2
---+----
 a |  1
 b |  2
(2 rows)

postgres=# COMMIT;
COMMIT
postgres=# SELECT * FROM foo;
 k | c2
---+----
 a |  1
 b |  2
(2 rows)

Proposal

In order to directly utilize the schema changes made during a transaction, it must be the case that at the time the transaction initiating these schema changes commits, the table descriptor must have a physical layout which includes newly added columns. In order to achieve the above condition, we know that it must be the case that all transactions which hold a lease on the table being mutated must hold a table descriptor which allows those transactions to properly interpret the new physical layout. It also must be the case that schema changes do not block foreground traffic. It is okay if ongoing transactions block schema changes (they already do today).

The idea in the proposal is that transactions which seek to alter the physical layout of tables will create new versions of the table descriptor in question using separate transactions. These provisional table descriptors will retain the logical layout of the table but will inform lessees of upcoming physical layout changes. In this way, when the transaction performing the schema change commits, it can commit the table descriptor in a state such that the newly added columns are in WRITE_ONLY and the newly dropped columns are in DELETE_ONLY. The transaction itself can treat the newly added columns as PUBLIC and the newly dropped columns as fully dropped as that step is merely a logical change relative to the WRITE_ONLY/DELETE_ONLY state that the table descriptor will actually be in upon commit.

Example with exposition

Let's walk through the above example to try to fill in some of the details of the proposal.

postgres=# BEGIN;                                                                                                                                                                                                                                                                                                           
BEGIN
postgres=# ALTER TABLE foo ADD COLUMN v1 INT NOT NULL DEFAULT 1;                                                                                                                                                                                                                                                            
ALTER TABLE

Ideally at this point the ongoing transaction could treat c1 as a PUBLIC column and could perform reads and writes against it. Let's say that at this moment foo is at version 1 (foo@v1). In this proposal, during the execution of the above statement we'd perform a single-step schema change in a separate transaction. This single-step schema change would add the new column in the DELETE_ONLY state. In the delete-only state nobody could write to that column but other transactions would know how to interpret the column if they were to come across it. This table descriptor foo@v2 would additionally contain a reference to the ongoing transaction (perhaps this implies that the transaction needs to already have written a transaction record or something, tbd). No future schema changes on the table descriptor from other transactions will be allowed while the ongoing transaction remains live. When the side transaction commits it will then wait for all leases on the old version to be dropped.

One thing to note is that the ongoing transaction cannot lay down an intent on the table descriptor as that would hold off the side transactions. It needs to maintain the view of where it wants to move the table descriptor in-memory and only attempt to write to the table descriptor (or read from it) upon commit.

Once the side schema change commits and has evidence that all leases are up-to-date, the statement execution can return. At this point the transaction will hold in its in-memory state a version of the table descriptor which has the column in WRITE_ONLY with a pending mutation to perform a backfill that it intends to write upon COMMIT. It will also have an in-memory version of the table descriptor which has the new column as PUBLIC which it will use for the execution of the rest of the transaction.

Okay, now let's proceed with the example:

postgres=# INSERT INTO foo VALUES ('a', 1);                                                                                                                                                                                                                                                                                 
INSERT 0 1
postgres=# SELECT * FROM foo;                                                                                                                                                                                                                                                                                               
 k | c1
---+----
 a |  1
(1 row)

Here we see the ongoing transaction using the PUBLIC view of the new column.

postgres=# ALTER TABLE foo DROP COLUMN c1;                                                                                                                                                                                                                                                                                  
ALTER TABLE

To execute the above statement we'll need to:

write a new provisional table descriptor in a side txn which has no reference to the new column (functionally equivalent to foo@v1) and wait for it to be leased.
perform the "backfill" on rows written in this transaction which might have used a different physical layout.

postgres=# ALTER TABLE foo ADD COLUMN c2 INT NOT NULL DEFAULT 1;
ALTER TABLE

When we do another mutation we need to go through the same exact dance. All rows written during this ongoing transaction to the table being modified will need to be backfilled such that they have the physical layout of the table as described by the descriptor which will be committed. The backfill process determines which rows to rewrite based on their MVCC timestamp. Specifically it will only backfill rows which have an MVCC timestamp earlier than the ModificationTime (i.e. commit timestamp as of #40581) which moved the newly added column to WRITE_ONLY.

postgres=# COMMIT;
COMMIT

At this point we'll need to write the table descriptor to a new version which has the newly added

Summary

For each DML statement performing a physical change to the layout of the primary index in a transaction:

1) In a side transaction perform a separate schema change to inform concurrent transactions of the upcoming physical change. 2) Ensure that all writes in the transaction use the physical layout implied for the table at the moment that it commits.

cc @jordanlewis I think this proposal is less crazy than some of the other things that were discussed. That other approach being to change the format of key encoding to refer to the version for its physical layout. Such an encoding change would allow a table to meaningfully contain multiple physical layouts at a given MVCC timestamp. It also comes with a bunch of other complexity.

ajwerner commented 4 years ago

One thing the above comment missed deals with dropping columns which existed prior to the transaction. Today the table descriptor version created by a transaction which drops a column logically removes the column but does not physically remove the column. It is critical that the "provisional" table descriptor which is published while the transaction is still running does not logically change the table. I suspect this will require adding a new state for a column in the table descriptor, something like DROP_PENDING.

Also want to note that the above comment is much more of a blueprint for a design than an actual design; there is a large amount of hand-waving.

knz commented 4 years ago

I like this exposition. It is at least quite inspiring.

Can you walk us through some example of what happens when the txn that contains the DDL gets aborted?

ajwerner commented 4 years ago

Can you walk us through some example of what happens when the txn that contains the DDL gets aborted?

In my current imagining of the proposal the "provisional" table descriptor acts in many ways like an intent. These table descriptors will differ from intents in so far as they are not cleared upon abort or commit but rather remain a part of the durable history for the table descriptor. When a transaction which wrote such a table descriptor rolls back the coordinator would attempt to write a new table descriptor which reflects the state of the descriptor prior to the first provisional table descriptor of the transaction. Like intents, any actor can "remove" a provisional table descriptor (overwrite with a new version which logically rolls back the provisional change) so long as it is known that the transaction which wrote that descriptor has been aborted.

Another thing to keep in mind is rolling back to savepoints. For each savepoint we'll want to maintain the state of provisional table descriptors so that we can detect (1) tables for which newer provisional table descriptors have been written since the savepoint and (2) tables for which provisional table descriptors were written only after the savepoint. For (1) we'll need to publish a new provisional table descriptor which sets the logical state of the table to where it was at the point of the savepoint and (2) to remove the provisional table descriptor. An alternative for (2) would be to mark a sequence number of the provisional table descriptor and allow other transactions to note that the provisional table descriptor is not in use by an active transaction and can be "removed".

knz commented 4 years ago

If that's all the same to you, I would prefer that the v1 of savepoints get merged and adopted so that the next person in charge of the schema change improvement does the savepoint logic and I'm not involved :smile_cat:

bdarnell commented 4 years ago

I also like this proposal.

As for savepoints, it would be great if they worked, but I also wouldn't block this work on supporting them - it would be fine if transactions containing DDLs can only be aborted in their entirety instead of rolling back to a savepoint.

bdarnell commented 4 years ago

This proposal makes our DDL (especially ADD COLUMN) more pg-compatible. But does it help with the non-atomicity problem that prompted this issue?

This discussion started with the addition of a unique constraint, not a column. It's one thing to add a new column to the descriptor in DELETE_ONLY state; this is semantically invisible. But to solve the atomicity problem, we'd need to start enforcing the constraint before the transaction that added it commits. That feels dirty (almost like READ UNCOMMITTED), although I think we can also look at it as just another kind of transaction conflict.

If that's semantically acceptable, is it practically implementable? Adding a column can proceed one step at a time: side transaction to introduce the column as DELETE_ONLY, then the transaction commit changes it to WRITE_ONLY and starts the backfill. If we're adding a unique index, I think we'd need to wait through the entire backfill before we could allow the ddl transaction to commit.

ajwerner commented 4 years ago

This discussion started with the addition of a unique constraint, not a column. This discussion started with the addition of a unique constraint, not a column.

Noted. I agree that my comment in no way addresses that issue. The title and some of the content seemed like the most relevant issue to which to attach the discussion. I don't have anything to add regarding the constraint atomicity problems.

I created https://github.com/cockroachdb/cockroach/issues/43057 as a separate issue.

leafji commented 4 years ago

@ajwerner i try to understand your Proposal, but have serveral questions:

why not add new column in uncommited descriptor directly? This methods is also mentioned in #26508. Reference: in order to observe and act on the effect of the ADD COLUMN we would need the physical layout of the table at the timestamp at which this transaction commits to differ from the physical layout of the table when the transaction began.
what is difference between logical layout and physical layout, especially inside transaction? Waiting for your explanation, thanks a lot.

ajwerner commented 4 years ago

Cockroachdb schema changes are online, meaning they don’t block concurrent operations. Nodes lease their view of the table. That allows nodes to make assumptions about the format of data they will read. If a transaction is able to commit in a transaction a row with a different format, then some other concurrent transaction will have its invariant violated. Hopefully that explanation sufficiently contextualizes the above proposal. There will be a detailed RFC on this very topic in the coming month.

This online schema change scheme is taken from the F1 paper https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41344.pdf. You can read the original design as it pertained to cockroach here: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20151014_online_schema_change.md

The leasing protocol is mediated is described here: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20151009_table_descriptor_lease.md

cockroachdb / cockroach