Open knz opened 5 years ago
Proposing a session flag strict_ddl_atomicity
to control this in PR #42063
Perhaps related (but unlikely): #42064
I created an inventory of all the currently recognized DDL and how they are affected by this error:
DDL | Sync part | Async part | Async may fail and yield XXA00? | Constraint-like DDL which can be VALIDATEd |
---|---|---|---|---|
CREATE USER | insert row | none | N/A | N/A |
CREATE ROLE | insert row | none | N/A | N/A |
CREATE VIEW | create desc, non-public | publish desc | no | N/A |
CREATE SEQUENCE | create desc, non-public | publish desc | no | N/A |
CREATE TABLE, no constraints | create desc, non-public | publish desc | no | N/A |
CREATE TABLE, with constraints | create desc, non-public + constraints immediately active | publish desc | no | N/A |
CREATE TABLE AS (see note below) | create desc, non-public + insert | publish desc | no | N/A |
CREATE INDEX (non-unique) | create desc, non-public | index backfill, then publish desc | no | yes (index active/inactive) |
CREATE UNIQUE INDEX | create desc, non-public | index backfill, then publish desc | yes (duplicate rows) | yes (index active/inactive + UNIQUE constraint) |
CREATE STATISTICS | define a "create stats" job | none | no | N/A |
TRUNCATE TABLE | mark desc as dropped, create new desc | unpublish old desc, publish new desc | no | N/A |
DROP USER | delete row | none | N/A | N/A |
DROP ROLE | delete row | none | N/A | N/A |
DROP DATABASE RESTRICT | delete row | none | N/A | N/A |
DROP VIEW RESTRICT | unpublish | delete desc | no | N/A |
DROP TABLE RESTRICT | unpublish | delete desc | no | N/A |
DROP INDEX RESTRICT | unpublish | delete desc | no | N/A |
DROP SEQUENCE RESTRICT | unpublish | delete desc | no | N/A |
DROP VIEW CASCADE | unpublish + drop dependencies | delete desc + dependency async | same as deps | N/A |
DROP TABLE CASCADE | unpublish + drop dependencies | delete desc + dependency async | same as deps | N/A |
DROP INDEX CASCADE | unpublish + drop dependencies | delete desc + dependency async | same as deps | N/A |
DROP SEQUENCE CASCADE | unpublish + drop dependencies | delete desc + dependency async | same as deps | N/A |
DROP DATABASE CASCADE | delete db row + drop dependencies | dependency async | same as deps | N/A |
ALTER TABLE RENAME | update desc in-RAM | publish updated desc | no | N/A |
ALTER TABLE RENAME COLUMN | update desc in-RAM | publish updated desc | no | N/A |
ALTER TABLE RENAME CONSTRAINT | update desc in-RAM | publish updated desc | no | N/A |
ALTER TABLE DROP COLUMN | create mutation, non-public | publish mutation | no | N/A |
ALTER TABLE DROP CONSTRAINT | create mutation, non-public | publish mutation | no | N/A |
ALTER TABLE ALTER DROP DEFAULT | create mutation, non-public | publish mutation | no | N/A |
ALTER TABLE ALTER DROP NOT NULL | create mutation, non-public | publish mutation | no | N/A |
ALTER TABLE ALTER DROP STORED | create mutation, non-public | publish mutation | no | N/A |
ALTER TABLE ALTER TYPE (no conv) | create mutation, non-public | publish mutation | no | N/A |
ALTER TABLE AUDIT SET | create mutation, non-public | publish mutation | no | N/A |
ALTER TABLE ADD COLUMN NULL | create mutation, non-public | publish mutation | no | N/A |
ALTER TABLE ADD CONSTRAINT NOT NULL NOT VALID | create non-public mutation, mark constraint as invalid | publish desc | no | N/A |
ALTER TABLE ADD CONSTRAINT DEFAULT NOT VALID | create non-public mutation, mark constraint as invalid | publish desc | no | N/A |
ALTER TABLE ADD CONSTRAINT UNIQUE NOT VALID | create non-public mutation, mark constraint as invalid | publish desc | no | N/A |
ALTER TABLE ADD CONSTRAINT CHECK NOT VALID | create non-public mutation, mark constraint as invalid | publish desc | no | N/A |
ALTER TABLE ADD CONSTRAINT FK NOT VALID | create non-public mutation, mark constraint as invalid | publish desc | no | N/A |
ALTER TABLE ADD COLUMN DEFAULT | create mutation, non-public | column backfill, then publish desc | yes (scalar eval fails) | NO (but see ALTER ADD CONSTRAINT SET DEFAULT) |
ALTER TABLE ADD CONSTRAINT NOT NULL | create mutation, non-public | column validation, then publish desc | yes (null row) | yes |
ALTER TABLE ADD CONSTRAINT DEFAULT | create mutation, non-public | column backfill, then publish desc | yes (scalar eval fails) | yes |
ALTER TABLE ADD CONSTRAINT UNIQUE | create mutation + index, non-public | index backfill, then publish desc | yes (dup row) | yes |
ALTER TABLE ADD CONSTRAINT CHECK | create mutation, non-public | check validation, then publish desc | yes (check fails) | yes |
ALTER TABLE ADD CONSTRAINT FK | create mutation + index, non-public | index backfill + FK validation, then publish desc | yes (unique error, row not found) | yes |
ALTER TABLE ALTER SET DEFAULT | create mutation, non-public | column backfill, then publish desc | yes (scalar eval fails) | NO (!!!) |
ALTER TABLE ALTER TYPE (w/ data conv) | create mutation, non-public | background data conversion, then publish desc | yes (conversion fails) | NO (!!!) |
ALTER TABLE PARTITION BY | create mutation, non-public | perform partitioning, then publish desc | no (?) | N/A (?) |
ALTER TABLE VALIDATE CONSTRAINT | run validation in-line | none | N/A | N/A |
ALTER TABLE SPLIT AT | issue admin split command | none | N/A | N/A |
ALTER TABLE RELOCATE | issue admin relocate command | none | N/A | N/A |
ALTER TABLE UNSPLIT AT | issue admin unsplit command | none | N/A | N/A |
ALTER TABLE UNSPLIT ALL | issue admin unsplit command | none | N/A | N/A |
ALTER TABLE SCATTER | issue admin scatter command | none | N/A | N/A |
ALTER TABLE INJECT STATISTICS | delete + insert row in system table | none | N/A | N/A |
ALTER TABLE CONFIGURE ZONE | update row in system table | none | N/A | N/A |
ALTER SEQUENCE | update desc in-RAM | publish updated desc | no | N/A |
ALTER INDEX RENAME | update desc in-RAM | publish updated desc | no | N/A |
ALTER INDEX SPLIT AT | issue admin split command | none | N/A | N/A |
ALTER INDEX RELOCATE | issue admin relocate command | none | N/A | N/A |
ALTER INDEX UNSPLIT AT | issue admin unsplit command | none | N/A | N/A |
ALTER INDEX UNSPLIT ALL | issue admin unsplit command | none | N/A | N/A |
ALTER INDEX SCATTER | issue admin scatter command | none | N/A | N/A |
ALTER INDEX CONFIGURE ZONE | update row in system table | none | N/A | N/A |
Note about CREATE TABLE AS: this statement processes the insert asynchronously only when executed outside an explicit txn.
@andy-kimball in a separate thread suggests that every "long running" DDL should be considered as split between a sync part and an async part, with the understanding that the async part may fail.
A key part of this proposal is that a client must be able to verify whether the async part has completed successfully.
When the complex DDL is related to a constraint, this is trivially possible, as both pg and crdb have a VALIDATE CONSTRAINT statement for this specific purpose. This can be even extended to CREATE UNIQUE INDEX, which can be understood as CREATE INDEX + ADD CONSTRAINT UNIQUE.
However as per the table in my previous comment above, the following statements do not have a clear way for clients to control completion:
For reference here is what pg 12 has to say about ALTER TABLE SET TYPE, ADD DEFAULT and other ALTER forms that require a table rewrite: https://www.postgresql.org/docs/12/mvcc-caveats.html
Some DDL commands, currently only TRUNCATE and the table-rewriting forms of ALTER TABLE, are not MVCC-safe. This means that after the truncation or rewrite commits, the table will appear empty to concurrent transactions, if they are using a snapshot taken before the DDL command committed. This will only be an issue for a transaction that did not access the table in question before the DDL command started — any transaction that has done so would hold at least an ACCESS SHARE table lock, which would block the DDL command until that transaction completes. So these commands will not cause any apparent inconsistency in the table contents for successive queries on the target table, but they could cause visible inconsistency between the contents of the target table and other tables in the database.
For reference here is what MSSQL has to say about online changes, including type changes and new columns: https://docs.microsoft.com/en-us/sql/t-sql/statements/alter-table-transact-sql?view=sql-server-ver15
Online alter column allows user created and autostatistics to reference the altered column for the duration of the ALTER COLUMN operation, which allows queries to run as usual. At the end of the operation, autostats that reference the column are dropped and user-created stats are invalidated. The user must manually update user-generated statistics after the operation is completed. If the column is part of a filter expression for any statistics or indexes then you can't perform an alter column operation. While the online alter column operation is running, all operations that could take a dependency on the column (index, views, and so on.) block or fail with an appropriate error. This behavior guarantees that online alter column won't fail because of dependencies introduced while the operation was running.
The language is a bit obfuscated but this says (in other words) that ALTER COLUMN SET TYPE really adds a new column to perform the change, then converts, then drops the original column. Until/unless the conversion completes, concurrent txns either continue to observe the original column (until the DDL-containing txn commits) or get suspended while waiting on a lock for the new column (after the DDL-containing txn logically commits).
Similarly for new columns, the new column remains invisible until the backfill completes.
(It's not clear how clients are meant to observe whether the backfill completed successfully or not.)
As per today's SIG notes, @lucy-zhang and @dt to review my analysis above (including the complete DDL table).
Separately, regarding Andy's proposal to avoid an error in COMMIT even when the async DDL fails, @rafiss contributes:
it may not be clear to the user or schema migration tool that a successful COMMIT can leave the constraint in a NOT VALIDATED state. The risk to the user is that they continue adding data to the table, and they think it satisfies the constraint, but in fact more constraint violations pile up, making it even harder to fix the data and validate the constraint later (and that's if anyone ever realizes they need to do that). I don't really think it would be easy to clarify this with user education alone -- there are so many tools that exist already that assume a successful commit means that the constraints added in the transaction were applied and validated successfully. The fundamental issue is that, to me, using unvalidated constraints by default seems like an extremely significant departure from what users may expect, especially those expecting lots of Postgres compatibility. Correction: Radu pointed out to me that newly inserted data will actually be validated against the constraint even when it is in the NOT VALIDATED state. But, still I have concerns about leaving the constraint NOT VALIDATED in a way that would be hard for a user to ever notice.
Example Go ORM affected: http://github.com/gobuffalo/pop especially https://github.com/gobuffalo/pop/blob/master/migrator.go#L93
Will yield invalid ORM migration behavior if error XXA00 occurs.
The fundamental issue is that, to me, using unvalidated constraints by default seems like an extremely significant departure from what users may expect, especially those expecting lots of Postgres compatibility.
Yeah, this is the key, and suggests another path forward: require the NOT VALID
modifier on all constraint additions in transactions, turning all such changes into explicit two-step operations to add the invalid constraint and then validate it.
suggests another path forward: require the NOT VALID modifier on all constraint additions in transactions
This works for constraints but does not help with ALTER SET TYPE and ADD COLUMN DEFAULT (and add computed column).
For DEFAULT columns maybe we can engineer a restriction that only scalar expressions that are guaranteed to never fail are allowed. That'd be hard and of very dubious UX. Also it would be much harder / impossible to engineer this type of restriction for computed columns.
And we're still left in the blue for SET TYPE.
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Are unique constraints/indexes included in the category of "constraints"? It's hard to see what the equivalent of an "unvalidated"/NOT VALID
state is for a unique constraint, basically because a unique constraint is global (whereas check/FK constraints are not).
Are unique constraints/indexes included in the category of "constraints"? It's hard to see what the equivalent of an "unvalidated"/NOT VALID state is for a unique constraint, basically because a unique constraint is global (whereas check/FK constraints are not).
In postgres, a CREATE UNIQUE INDEX CONCURRENTLY
command can leave behind an INVALID index: https://www.postgresql.org/docs/9.1/sql-createindex.html. It is forbidden to use CREATE (UNIQUE) INDEX CONCURRENTLY
in an explicit transaction.
So an INVALID
index in Postgres can either be (1) dropped and recreated, or (2) "rebuilt", during which the table is offline/locked. Since we can't support (2), what would be the advantage of us keeping an INVALID
index in the logical schema at all, if all a user can do is drop it? (Also, unlike Postgres, since we presumably wouldn't support REBUILD
, there would also be no point in continuing to keep the index updated on new writes. (edit: Actually, there would be a "point" in that we'd check for uniqueness violations on some subset of the rows. Is that useful?))
I guess the other question is whether we've considered supporting DDL statements (or mixed DDL/DML transactions) that do require the table to be offline, in order to solve some of these problems.
edit 2: I've been assuming throughout that unique constraints are always backed by unique indexes, and uniqueness violations manifest as key collisions in the index; if I'm wrong, please correct me.
I think it's still somewhat useful to keep checking for uniqueness violations on some subset. At least things are not getting worse. That said, I'm not as opinionated on what to do with the index once the backfill job has failed. More important is being very deliberate and explicit about separating long-running O(# rows) operations into a sync and an async phase, and then ensuring that the sync phase is fully transactional and predictable (and documenting that).
Unique constraints don't always need to be enforced via index key collisions. See https://github.com/cockroachdb/cockroach/issues/41535 for an example where they're not.
Are unique constraints/indexes included in the category of "constraints"? It's hard to see what the equivalent of an "unvalidated"/NOT VALID state is for a unique constraint, basically because a unique constraint is global (whereas check/FK constraints are not).
It seems to me that it's clear what is going on: if it's unvalidated, that means it's inactive and there can still be duplicate rows.
Also UNIQUE can be handled via the constraint system (the SQL standard and pg's dialect have all unique constraints receive a name in the constraint table, so we can attach state to it).
Are unique constraints/indexes included in the category of "constraints"? It's hard to see what the equivalent of an "unvalidated"/NOT VALID state is for a unique constraint, basically because a unique constraint is global (whereas check/FK constraints are not).
It seems to me that it's clear what is going on: if it's unvalidated, that means it's inactive and there can still be duplicate rows.
But the guarantee provided by "unvalidated" for FK/check constraints is stronger: it means that the constraint is enforced for all writes going forward. This is not the case for unvalidated/invalid unique constraints with a partially built index, because we can't guarantee that a new row is actually globally unique in the table, even if we can guarantee that the row is unique across all rows updated since the constraint was added. This distinction seems important to me, but maybe it's unimportant for users in practice, as long as we flag the constraint as unvalidated/invalid.
That's a relatively minor point. I'm mostly wondering about how well this plan (having the transaction succeed if the synchronous portion of the schema change succeeds, and alerting failures during the async phase via another channel) will actually work for users' most common use cases, especially when they're using migration tools and ORMs. Will they need to manually monitor their migration to see if it succeeded?
I'm mostly wondering about how well this plan (having the transaction succeed if the synchronous portion of the schema change succeeds, and alerting failures during the async phase via another channel) will actually work for users' most common use cases, especially when they're using migration tools and ORMs.
I am wondering about this too. I have developed this idea above at the urging of @andy-kimball (who outlined this is the choice made in MSSQL, which, in contrast to pg, has designed for online schema changes, and so is a good example to follow), but @rafiss pointed out rightly that we advertise pg compat and pg is certainly different, so the migration story would be difficult.
At this point I don't know that we have a clear preference yet. Ben and Tobias are floating technical approaches that keep pg compatibility, and I suspect either yourself or David will have some opinions on this too.
I think the next step forward now is to:
1) document/clarify the current behavior for users 2) (optionally) move forward with #42063 if there's enough interest 3) develop a tech note / RFC with the various alternative ways forward that have been discussed so far, and weigh the pros and cons for each.
This lack of atomicity has come up in several conversations today so I figured I'd dump what I recall of those conversations. The proposal here was formulted during a conversation with @nvanbenschoten. The conversations focused on adding and removing columns in transactions, the DMLs which alter the physical layout of the primary index.
When dropping a column, the column is logically dropped immediately when the transaction which is performing the drop commits; the column goes from PUBLIC
->WRITE_ONLY
->DELETE_ONLY
->DROPPED
. The important thing to note is that the writing transaction uses its provisional view of the table descriptor. The following is fine:
root@127.179.54.55:40591/movr> CREATE TABLE foo (k STRING PRIMARY KEY, c1 INT, c2, INT);
root@127.179.54.55:40591/movr> BEGIN;
root@127.179.54.55:40591/movr OPEN> ALTER TABLE foo DROP COLUMN c2;
ALTER TABLE
root@127.179.54.55:40591/movr OPEN> SELECT * FROM foo;
k | c1
+---+----+
(0 rows)
root@127.179.54.55:40591/movr OPEN> ALTER TABLE foo DROP COLUMN c1;
ALTER TABLE
root@127.179.54.55:40591/movr OPEN> SELECT * FROM foo;
k
+---+
(0 rows)
root@127.179.54.55:40591/movr OPEN> COMMIT;
When the above transaction commits, foo
is in version 2 and has pending mutations to drop the columns c1
and c2
. One thing to note is that transactions which happen after the above transaction may still observe those columns due to schema leasing. Another thing to note is that under versions 1 and 2 of foo, the physical layout has not change but the logical layout. When version 3 (DELETE_ONLY
) comes in to effect, the physical layout may differ from version 1. It's critical that by the time the physical layout begins to change we know that no transactions are in version 1.
Now let's look at the ADD COLUMN
case.
root@127.179.54.55:40591/movr> CREATE TABLE foo (k STRING PRIMARY KEY);
root@127.179.54.55:40591/movr> BEGIN;
root@127.179.54.55:40591/movr OPEN> ALTER TABLE foo ADD COLUMN c1 INT;
ALTER TABLE
Time: 2.285371ms
root@127.179.54.55:40591/movr OPEN> SELECT * FROM foo;
k
+---+
(0 rows)
Time: 526.967µs
root@127.179.54.55:40591/movr OPEN> ALTER TABLE foo ADD COLUMN c2 INT;
ALTER TABLE
Time: 3.795652ms
root@127.179.54.55:40591/movr OPEN> SELECT * FROM foo;
k
+---+
(0 rows)
Time: 516.892µs
root@127.179.54.55:40591/movr OPEN> COMMIT;
root@127.179.54.55:40591/movr> SELECT * FROM foo;
k | c1 | c2
+------+------+------+
(0 rows)
How bizarre is that! In the DROP COLUMN
case we observe the effect immediately, within the transaction. In the ADD COLUMN
case we don't observe the effect during the transaction.
The reason for this is relatively straightforward: in order to observe and act on the effect of the ADD COLUMN
we would need the physical layout of the table at the timestamp at which this transaction commits to differ from the physical layout of the table when the transaction began. In order for the schema change protocol to maintain correctness, it must be the case that for any given point in time (MVCC time) that a table has exactly one physical layout. We achieve this through the use of the invariant that at most there are two versions of a table leased at a time: the latest version and the preceding version. Between two versions of a table descriptor either the logical layout or the physical layout may change but never both. If both the logical and physical layout changed in one version step, lessees of the older version may not know how to interpret table data. Put differently, by the time a physical layout changes, all lessees of the table are guaranteed to know how to interpret the new physical layout.
We'd like to permit the following which works in postgres:
postgres=# BEGIN;
BEGIN
postgres=# ALTER TABLE foo ADD COLUMN c1 INT NOT NULL DEFAULT 1;
ALTER TABLE
postgres=# INSERT INTO foo VALUES ('a', 1);
INSERT 0 1
postgres=# SELECT * FROM foo;
k | c1
---+----
a | 1
(1 row)
postgres=# ALTER TABLE foo DROP COLUMN c1;
ALTER TABLE
postgres=# ALTER TABLE foo ADD COLUMN c2 INT NOT NULL DEFAULT 1;
ALTER TABLE
postgres=# INSERT INTO foo VALUES ('b', 2);
INSERT 0 1
postgres=# SELECT * FROM foo;
k | c2
---+----
a | 1
b | 2
(2 rows)
postgres=# COMMIT;
COMMIT
postgres=# SELECT * FROM foo;
k | c2
---+----
a | 1
b | 2
(2 rows)
In order to directly utilize the schema changes made during a transaction, it must be the case that at the time the transaction initiating these schema changes commits, the table descriptor must have a physical layout which includes newly added columns. In order to achieve the above condition, we know that it must be the case that all transactions which hold a lease on the table being mutated must hold a table descriptor which allows those transactions to properly interpret the new physical layout. It also must be the case that schema changes do not block foreground traffic. It is okay if ongoing transactions block schema changes (they already do today).
The idea in the proposal is that transactions which seek to alter the physical layout of tables will create new versions of the table descriptor in question using separate transactions. These provisional table descriptors will retain the logical layout of the table but will inform lessees of upcoming physical layout changes. In this way, when the transaction performing the schema change commits, it can commit the table descriptor in a state such that the newly added columns are in WRITE_ONLY
and the newly dropped columns are in DELETE_ONLY
. The transaction itself can treat the newly added columns as PUBLIC
and the newly dropped columns as fully dropped as that step is merely a logical change relative to the WRITE_ONLY
/DELETE_ONLY
state that the table descriptor will actually be in upon commit.
Let's walk through the above example to try to fill in some of the details of the proposal.
postgres=# BEGIN;
BEGIN
postgres=# ALTER TABLE foo ADD COLUMN v1 INT NOT NULL DEFAULT 1;
ALTER TABLE
Ideally at this point the ongoing transaction could treat c1
as a PUBLIC
column and could perform reads and writes against it. Let's say that at this moment foo
is at version 1 (foo@v1
). In this proposal, during the execution of the above statement we'd perform a single-step schema change in a separate transaction. This single-step schema change would add the new column in the DELETE_ONLY
state. In the delete-only state nobody could write to that column but other transactions would know how to interpret the column if they were to come across it. This table descriptor foo@v2
would additionally contain a reference to the ongoing transaction (perhaps this implies that the transaction needs to already have written a transaction record or something, tbd). No future schema changes on the table descriptor from other transactions will be allowed while the ongoing transaction remains live. When the side transaction commits it will then wait for all leases on the old version to be dropped.
One thing to note is that the ongoing transaction cannot lay down an intent on the table descriptor as that would hold off the side transactions. It needs to maintain the view of where it wants to move the table descriptor in-memory and only attempt to write to the table descriptor (or read from it) upon commit.
Once the side schema change commits and has evidence that all leases are up-to-date, the statement execution can return. At this point the transaction will hold in its in-memory state a version of the table descriptor which has the column in WRITE_ONLY
with a pending mutation to perform a backfill that it intends to write upon COMMIT
. It will also have an in-memory version of the table descriptor which has the new column as PUBLIC
which it will use for the execution of the rest of the transaction.
Okay, now let's proceed with the example:
postgres=# INSERT INTO foo VALUES ('a', 1);
INSERT 0 1
postgres=# SELECT * FROM foo;
k | c1
---+----
a | 1
(1 row)
Here we see the ongoing transaction using the PUBLIC
view of the new column.
postgres=# ALTER TABLE foo DROP COLUMN c1;
ALTER TABLE
To execute the above statement we'll need to:
foo@v1
) and wait for it to be leased.postgres=# ALTER TABLE foo ADD COLUMN c2 INT NOT NULL DEFAULT 1;
ALTER TABLE
When we do another mutation we need to go through the same exact dance. All rows written during this ongoing transaction to the table being modified will need to be backfilled such that they have the physical layout of the table as described by the descriptor which will be committed. The backfill process determines which rows to rewrite based on their MVCC timestamp. Specifically it will only backfill rows which have an MVCC timestamp earlier than the ModificationTime (i.e. commit timestamp as of #40581) which moved the newly added column to WRITE_ONLY
.
postgres=# COMMIT;
COMMIT
At this point we'll need to write the table descriptor to a new version which has the newly added
For each DML statement performing a physical change to the layout of the primary index in a transaction:
1) In a side transaction perform a separate schema change to inform concurrent transactions of the upcoming physical change. 2) Ensure that all writes in the transaction use the physical layout implied for the table at the moment that it commits.
cc @jordanlewis I think this proposal is less crazy than some of the other things that were discussed. That other approach being to change the format of key encoding to refer to the version for its physical layout. Such an encoding change would allow a table to meaningfully contain multiple physical layouts at a given MVCC timestamp. It also comes with a bunch of other complexity.
One thing the above comment missed deals with dropping columns which existed prior to the transaction. Today the table descriptor version created by a transaction which drops a column logically removes the column but does not physically remove the column. It is critical that the "provisional" table descriptor which is published while the transaction is still running does not logically change the table. I suspect this will require adding a new state for a column in the table descriptor, something like DROP_PENDING
.
Also want to note that the above comment is much more of a blueprint for a design than an actual design; there is a large amount of hand-waving.
I like this exposition. It is at least quite inspiring.
Can you walk us through some example of what happens when the txn that contains the DDL gets aborted?
Can you walk us through some example of what happens when the txn that contains the DDL gets aborted?
In my current imagining of the proposal the "provisional" table descriptor acts in many ways like an intent. These table descriptors will differ from intents in so far as they are not cleared upon abort or commit but rather remain a part of the durable history for the table descriptor. When a transaction which wrote such a table descriptor rolls back the coordinator would attempt to write a new table descriptor which reflects the state of the descriptor prior to the first provisional table descriptor of the transaction. Like intents, any actor can "remove" a provisional table descriptor (overwrite with a new version which logically rolls back the provisional change) so long as it is known that the transaction which wrote that descriptor has been aborted.
Another thing to keep in mind is rolling back to savepoints. For each savepoint we'll want to maintain the state of provisional table descriptors so that we can detect (1) tables for which newer provisional table descriptors have been written since the savepoint and (2) tables for which provisional table descriptors were written only after the savepoint. For (1) we'll need to publish a new provisional table descriptor which sets the logical state of the table to where it was at the point of the savepoint and (2) to remove the provisional table descriptor. An alternative for (2) would be to mark a sequence number of the provisional table descriptor and allow other transactions to note that the provisional table descriptor is not in use by an active transaction and can be "removed".
If that's all the same to you, I would prefer that the v1 of savepoints get merged and adopted so that the next person in charge of the schema change improvement does the savepoint logic and I'm not involved :smile_cat:
I also like this proposal.
As for savepoints, it would be great if they worked, but I also wouldn't block this work on supporting them - it would be fine if transactions containing DDLs can only be aborted in their entirety instead of rolling back to a savepoint.
This proposal makes our DDL (especially ADD COLUMN) more pg-compatible. But does it help with the non-atomicity problem that prompted this issue?
This discussion started with the addition of a unique constraint, not a column. It's one thing to add a new column to the descriptor in DELETE_ONLY state; this is semantically invisible. But to solve the atomicity problem, we'd need to start enforcing the constraint before the transaction that added it commits. That feels dirty (almost like READ UNCOMMITTED), although I think we can also look at it as just another kind of transaction conflict.
If that's semantically acceptable, is it practically implementable? Adding a column can proceed one step at a time: side transaction to introduce the column as DELETE_ONLY, then the transaction commit changes it to WRITE_ONLY and starts the backfill. If we're adding a unique index, I think we'd need to wait through the entire backfill before we could allow the ddl transaction to commit.
This discussion started with the addition of a unique constraint, not a column. This discussion started with the addition of a unique constraint, not a column.
Noted. I agree that my comment in no way addresses that issue. The title and some of the content seemed like the most relevant issue to which to attach the discussion. I don't have anything to add regarding the constraint atomicity problems.
I created https://github.com/cockroachdb/cockroach/issues/43057 as a separate issue.
@ajwerner i try to understand your Proposal, but have serveral questions:
Cockroachdb schema changes are online, meaning they don’t block concurrent operations. Nodes lease their view of the table. That allows nodes to make assumptions about the format of data they will read. If a transaction is able to commit in a transaction a row with a different format, then some other concurrent transaction will have its invariant violated. Hopefully that explanation sufficiently contextualizes the above proposal. There will be a detailed RFC on this very topic in the coming month.
This online schema change scheme is taken from the F1 paper https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41344.pdf. You can read the original design as it pertained to cockroach here: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20151014_online_schema_change.md
The leasing protocol is mediated is described here: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20151009_table_descriptor_lease.md
Related to #26508, #24785. Filing this in reaction to a recent thread with @mberhault
Today, CockroachDB will accept to process DDL statements inside explicit txns (BEGIN...COMMIT), but process them after the non-DDL statements have been committed successfully.
If the DDL subsequently fails, the transaction is then both partially committed and partially rolled back. i.e. an atomicity violation.
If/when this happens, is currently reported to the client upon COMMIT via the special error code XXA00 (TransactionCommittedWithSchemaChangeFailure)
There is a discussion to be had about whether the situation should be allowed to occur at all, but until it does this issue will exist so it can be linked from the source code and docs for explanatory purposes.
Jira issue: CRDB-5394