sql: `CREATE TABLE` inherits initial `num_replicas` from previous range

erikgrinaker commented 2 years ago

As seen in #71377, a CREATE TABLE with num_replicas = 3 can start out with a lower replication factor if it is split off from a range that happens to have a lower replication factor. The split range essentially inherits the replication factor and replicas from the LHS of the split, and then up/downreplicates as appropriate. This can make the table vulnerable to quorum loss until it is fully upreplicated.

Granted, this may be a subtle point since quorum loss on any range can cause cluster unavailability, so the low RF on the LHS range is already precarious. However, one could imagine scenarios where this could cause problems, e.g. having scratch tables or other low-importance tables with RF=1 mixed in with production tables with higher RF, where one could nuke the RF=1 ranges in case of quorum loss.

To avoid surprises, a new table should always start out with the configured RF, e.g. by having standby ranges with the appropriate RF for creating new tables.

To reproduce:

Start a three-node roachprod cluster:

$ roachprod create local -n 3
$ roachprod start local

Wait for system ranges to finish upreplicating (e.g. look at DB Console underreplication stats).

Create a table a with an explicit RF=1:

ALTER DATABASE defaultdb CONFIGURE ZONE USING num_replicas = 1;
CREATE TABLE a AS SELECT id, REPEAT('x', 1024) AS value FROM generate_series(1, 1e6) AS id;
ALTER TABLE a CONFIGURE ZONE USING num_replicas = 1;

Create a new table b that inherits RF=3 from the defaultdb zone config:

ALTER DATABASE defaultdb CONFIGURE ZONE USING num_replicas = 3;
CREATE TABLE b AS SELECT id, REPEAT('x', 1024) AS value FROM generate_series(1, 1e6) AS id;

Observe the cluster logs to see the table starting out with a 1-replica range and then upreplicating:

10:40:11.480122 2428 sql/table.go:165 ⋮ [n1,client=‹127.0.0.1:42620›,hostnossl,user=root] 695 queued new schema-change job 705083511538057217 for table 53, mutation 0
10:40:11.488808 2428 jobs/registry.go:345 ⋮ [n1,client=‹127.0.0.1:42620›,hostnossl,user=root] 696 scheduled jobs [705083511538057217]
10:40:11.489137 9129 kv/kvserver/replica_command.go:395 ⋮ [n1,split,s1,r67/1:‹/{Table/52/1/-…-Max}›] 697 initiating a split of this range at key ‹/Table/53› [r68] (‹span config›)‹›
10:40:11.496153 9508 jobs/adopt.go:247 ⋮ [-] 698 job 705083511538057217: resuming execution
10:40:11.498732 9511 jobs/registry.go:1168 ⋮ [n1] 699 SCHEMA CHANGE job 705083511538057217: stepping through state running with error: <nil>
10:40:11.508079 9511 sql/schema_changer.go:592 ⋮ [n1,job=705083511538057217,scExec,id=53] 700 schema change on ‹"b"› (v1) starting execution...
10:40:11.508121 9511 sql/schema_changer.go:383 ⋮ [n1,job=705083511538057217,scExec,id=53] 701 starting backfill for CREATE TABLE AS with query ‹"SELECT id, repeat('x':::STRING, 1024:::INT8) AS value FROM ROWS FROM (generate_series(1:::INT8, 1000000:::INT8)) AS id"›
10:40:11.519922 9390 kv/kvserver/replica_command.go:2110 ⋮ [n1,replicate,s1,r68/1:‹/{Table/53-Max}›] 702 change replicas (add [(n2,s2):2LEARNER] remove []): existing descriptor r68:‹/{Table/53-Max}› [(n1,s1):1, next=2, gen=29]
10:40:11.529347 9390 kv/kvserver/replica_raft.go:285 ⋮ [n1,s1,r68/1:‹/{Table/53-Max}›] 703 proposing SIMPLE(l2) [(n2,s2):2LEARNER]: after=[(n1,s1):1 (n2,s2):2LEARNER] next=3
10:40:11.534480 9390 kv/kvserver/store_snapshot.go:1133 ⋮ [n1,replicate,s1,r68/1:‹/{Table/53-Max}›] 704 streamed INITIAL snapshot ‹f9b840cf› at applied index 14 to (n2,s2):2LEARNER in 0.00s @ ‹2.7 MiB›/s: ‹kv pairs: 7, log entries: 0›, rate-limit: ‹32 MiB›/s, queued: 0.00s
10:40:11.547834 7416 kv/kvserver/replica_raftstorage.go:819 ⋮ [n2,s2,r68/2:{-}] 239 applying snapshot of type INITIAL [id=‹f9b840cf› index=14]
10:40:11.590532 7416 kv/kvserver/replica_raftstorage.go:840 ⋮ [n2,s2,r68/2:‹/{Table/53-Max}›] 240 applied snapshot of type INITIAL [‹total=43ms›‹›‹ingestion=6@40ms›id=‹f9b840cf› index=14]
10:40:11.591465 9390 kv/kvserver/replica_command.go:2110 ⋮ [n1,replicate,s1,r68/1:‹/{Table/53-Max}›] 706 change replicas (add [(n2,s2):2] remove []): existing descriptor r68:‹/{Table/53-Max}› [(n1,s1):1, (n2,s2):2LEARNER, next=3, gen=30]
10:40:11.599840 9390 kv/kvserver/replica_raft.go:285 ⋮ [n1,s1,r68/1:‹/{Table/53-Max}›] 707 proposing SIMPLE(v2) [(n2,s2):2]: after=[(n1,s1):1 (n2,s2):2] next=3
10:40:11.601521 9390 kv/kvserver/replica_command.go:2110 ⋮ [n1,replicate,s1,r68/1:‹/{Table/53-Max}›] 708 change replicas (add [(n3,s3):3LEARNER] remove []): existing descriptor r68:‹/{Table/53-Max}› [(n1,s1):1, (n2,s2):2, next=3, gen=31]
10:40:11.610932 9390 kv/kvserver/replica_raft.go:285 ⋮ [n1,s1,r68/1:‹/{Table/53-Max}›] 709 proposing SIMPLE(l3) [(n3,s3):3LEARNER]: after=[(n1,s1):1 (n2,s2):2 (n3,s3):3LEARNER] next=4
10:40:11.614270 9390 kv/kvserver/store_snapshot.go:1133 ⋮ [n1,replicate,s1,r68/1:‹/{Table/53-Max}›] 710 streamed INITIAL snapshot ‹73496069› at applied index 18 to (n3,s3):3LEARNER in 0.00s @ ‹16 MiB›/s: ‹kv pairs: 11, log entries: 0›, rate-limit: ‹32 MiB›/s, queued: 0.00s
10:40:11.626665 6033 kv/kvserver/replica_raftstorage.go:819 ⋮ [n3,s3,r68/3:{-}] 245 applying snapshot of type INITIAL [id=‹73496069› index=18]
10:40:11.663980 6033 kv/kvserver/replica_raftstorage.go:840 ⋮ [n3,s3,r68/3:‹/{Table/53-Max}›] 246 applied snapshot of type INITIAL [‹total=37ms›‹›‹ingestion=6@34ms›id=‹73496069› index=18]
10:40:11.664987 9390 kv/kvserver/replica_command.go:2110 ⋮ [n1,replicate,s1,r68/1:‹/{Table/53-Max}›] 712 change replicas (add [(n3,s3):3] remove []): existing descriptor r68:‹/{Table/53-Max}› [(n1,s1):1, (n2,s2):2, (n3,s3):3LEARNER, next=4, gen=32]
10:40:11.676850 9390 kv/kvserver/replica_raft.go:285 ⋮ [n1,s1,r68/1:‹/{Table/53-Max}›] 713 proposing SIMPLE(v3) [(n3,s3):3]: after=[(n1,s1):1 (n2,s2):2 (n3,s3):3] next=4

This does not happen when a starts out with RF=3:

11:07:29.958953 2547 sql/table.go:165 ⋮ [n1,client=‹127.0.0.1:42734›,hostnossl,user=root] 614 queued new schema-change job 705088880505487361 for table 53, mutation 0
11:07:29.977774 2547 jobs/registry.go:345 ⋮ [n1,client=‹127.0.0.1:42734›,hostnossl,user=root] 615 scheduled jobs [705088880505487361]
11:07:29.978699 6695 kv/kvserver/replica_command.go:395 ⋮ [n1,split,s1,r79/1:‹/{Table/52/1/-…-Max}›] 616 initiating a split of this range at key ‹/Table/53› [r49] (‹span config›)‹›
11:07:29.987202 7134 jobs/adopt.go:247 ⋮ [-] 617 job 705088880505487361: resuming execution
11:07:29.989529 7569 jobs/registry.go:1168 ⋮ [n1] 618 SCHEMA CHANGE job 705088880505487361: stepping through state running with error: <nil>
11:07:30.003346 7569 sql/schema_changer.go:592 ⋮ [n1,job=705088880505487361,scExec,id=53] 619 schema change on ‹"b"› (v1) starting execution...
11:07:30.003396 7569 sql/schema_changer.go:383 ⋮ [n1,job=705088880505487361,scExec,id=53] 620 starting backfill for CREATE TABLE AS with query ‹"SELECT id, repeat('x':::STRING, 1024:::INT8) AS value FROM ROWS FROM (generate_series(1:::INT8, 1000000:::INT8)) AS id"›
11:07:30.569436 7635 kv/kvserver/replica_command.go:395 ⋮ [n1,s1,r49/1:‹/{Table/53-Max}›] 621 initiating a split of this range at key ‹/Table/53/1/-9222246136947933184› [r50] (‹manual›)‹›
11:07:30.725553 7635 kv/kvserver/replica_command.go:395 ⋮ [n1,s1,r50/1:‹/{Table/53/1/-…-Max}›] 622 initiating a split of this range at key ‹/Table/53/1/-9222246136947885149› [r51] (‹manual›)‹›
11:07:30.974574 1193 kv/kvserver/replica_command.go:395 ⋮ [n2,s2,r51/2:‹/{Table/53/1/-…-Max}›] 155 initiating a split of this range at key ‹/Table/53/1/-9222246136947837122› [r80] (‹manual›)‹; delayed by 0.0s to resolve: replica is raft follower (without success)›
11:07:31.538326 5143 kv/kvserver/replica_command.go:395 ⋮ [n2,s2,r80/2:‹/{Table/53/1/-…-Max}›] 156 initiating a split of this range at key ‹/Table/53/1/-9222246136947759004› [r81] (‹manual›)‹›
11:07:31.824042 5186 kv/kvserver/replica_command.go:395 ⋮ [n2,s2,r81/2:‹/{Table/53/1/-…-Max}›] 157 initiating a split of this range at key ‹/Table/53/1/-9222246136947710977› [r82] (‹manual›)‹›
11:07:37.055773 7569 sql/schema_changer.go:406 ⋮ [n1,job=705088880505487361,scExec,id=53] 628 making table public

Jira issue: CRDB-10874

ajwerner commented 2 years ago

I'm going to move this to KV where we can then likely document it away. At the end of the day, all of these configurations are asynchronous. Perhaps we need some tooling or mode to allow customers to wait for up (or down) replication after creating tables but before using them in a production capacity. The tool one might lean on is the replication reports. I don't really feel this is a schema problem.

I do wonder if @irfansharif has feelings on what we should put in our documentation or whether we should invent new contracts. cc @mwang1026 for a rare KV-oriented product discussion regarding replication factors and their relationship to changing schema/zone configs from the customer perspective.

irfansharif commented 2 years ago

This issue isn't just specific to num_replicas, it's for everything. If a table was split off from another that was pinned to regions A-C, and the new table was "created" pinned to regions D-F, we'd still observe its replicas start off at A-C. Inventing new contracts seems difficult (at least to this author) given configurations are propagated asynchronously. I agree it's very confusing to users though. The example above holds true for our MR tables where you can technically create them with specific regions in mind, quickly fire off writes to them, and have them (temporarily) land in regions other than those specified because of propagation delays.

I was imagining we'd introduce helpful primitives that would let you wait for a schema object's zone configs to be fully conformed to (WAIT UNTIL ...) and document the invariant that future writes to that table would conform to a zone config at least as recent as the one that was waited on. For our MR table example, after creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes. Whether the table creation itself should use this WAIT UNTIL primitive before returning to the caller, I don't know.

ajwerner commented 2 years ago

A use case that comes to mind where this splitting may be important is CREATE TABLE AS. It seems plausible that somebody would like to wait for the table ranges to split off and get properly replicated before beginning to load it with data.

One thought that comes to mind is that (in the spanconfig world) we won't propagate the state until we get a closed timestamp. We won't get a closed timestamp potentially for a while. Perhaps something that would be valuable would be some way to request that a timestamp be closed eagerly at the present over the whole span upon commit. Then we could have the schema change job wait for span configs to be applied before doing anything. Even that may not be enough.

I guess the latency is independent of the waiting. I think we'll want to solve both problems at some point.

erikgrinaker commented 2 years ago

A use case that comes to mind where this splitting may be important is CREATE TABLE AS. It seems plausible that somebody would like to wait for the table ranges to split off and get properly replicated before beginning to load it with data.

Backup restoration too. The issue that spawned this was a restore test failure where a node crash during restore caused quorum loss (the system range which the table split off from hadn't finished upreplicating yet following cluster creation). The upreplication activity would presumably also cause restoration to take longer.

Perhaps we need some tooling or mode to allow customers to wait for up (or down) replication after creating tables but before using them in a production capacity. The tool one might lean on is the replication reports. I don't really feel this is a schema problem.

Maybe. My initial thought here was that, for a given zone config, we could create a standby-range with that configuration, and split off schema entities from the appropriate range which would then already be in the correct configuration. We could possibly implement that today with the existing KV primitives. But maybe that's too simplistic, I haven't given this much thought.

This issue isn't just specific to num_replicas, it's for everything. If a table was split off from another that was pinned to regions A-C, and the new table was "created" pinned to regions D-F, we'd still observe its replicas start off at A-C. Inventing new contracts seems difficult (at least to this author) given configurations are propagated asynchronously. I agree it's very confusing to users though.

That's a good point, probably worth taking all of these considerations into account.

I was imagining we'd introduce helpful primitives that would let you wait for a schema object's zone configs to be fully conformed to (WAIT UNTIL ...) and document the invariant that future writes to that table would conform to a zone config at least as recent as the one that was waited on. For our MR table example, after creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes. Whether the table creation itself should use this WAIT UNTIL primitive before returning to the caller, I don't know.

That might help. I'm coming at this from the loss of quorum-angle though, where we'd really like to avoid being in a low-RF configuration at all, where node loss could wreck the cluster. But at least surfacing it to the user would be a good start.

mwang1026 commented 2 years ago

From a UX perspective, this is a "SURPRISE!" situation where, not knowing the guts of what's happening enough, come to the table saying "this should never happen" 🤷 So I don't think that this can simply be documented away., especially given this inherits all zone configs.

re: For our MR table example, after creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes. how long would something like this take? Said another way, what's the mechanism that would for the new table's ranges apply the zone configs that the user expects them to have?

irfansharif commented 2 years ago

After creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes.

How long would something like this take? Said another way, what's the mechanism that would for the new table's ranges apply the zone configs that the user expects them to have?

I'm not sure, we haven't done it. I'm guessing we could get it down to under a second.

stevendanna commented 2 years ago

For our MR table example, after creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes.

It seems like we might want to do the moral equivalent of this for the RESTORE use case at least.

mwang1026 commented 2 years ago

After creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes.

How long would something like this take? Said another way, what's the mechanism that would for the new table's ranges apply the zone configs that the user expects them to have?

I'm not sure, we haven't done it. I'm guessing we could get it down to under a second.

How hard would it be to stub something out, pressure test it and time how long it takes? We could use RESTORE as a cheap way to get a lot of those CREATE TABLE in succession per Steven's comment.

I'm also just trying to think through the scenario. Would users be upset if a CREATE TABLE statement takes second(s) rather than ms? I don't think so but something we can pressure test. In a RESTORE scenario I feel like (without evidence) that data loading is the bulk of time spent rather than creating tables unless it's a v. small restore.

irfansharif commented 2 years ago

To productionize any kind of polling, we'll want something like #70614 to be able to selectively use low closed timestamp target durations for the span configs infrastructure. It's not super difficult to generate some back of the envelope numbers, let's do it after we land + stabilize the last few pieces of #67679.

mwang1026 commented 2 years ago

sg. thanks

andrewbaptist commented 9 months ago

Should we consider delaying splitting off ranges for out of conformance data. This would work well for the split queue, but I'm not as sure for an AdminSplitRequest coming from other parts of the system. For internal operations it is more efficient to create the ranges with the correct config rather than creating them and later fixing them.

blathers-crl[bot] commented 8 months ago

cc @cockroachdb/disaster-recovery

cockroachdb / cockroach

sql: `CREATE TABLE` inherits initial `num_replicas` from previous range #71977