Open erikgrinaker opened 2 years ago
I'm going to move this to KV where we can then likely document it away. At the end of the day, all of these configurations are asynchronous. Perhaps we need some tooling or mode to allow customers to wait for up (or down) replication after creating tables but before using them in a production capacity. The tool one might lean on is the replication reports. I don't really feel this is a schema problem.
I do wonder if @irfansharif has feelings on what we should put in our documentation or whether we should invent new contracts. cc @mwang1026 for a rare KV-oriented product discussion regarding replication factors and their relationship to changing schema/zone configs from the customer perspective.
This issue isn't just specific to num_replicas
, it's for everything. If a table was split off from another that was pinned to regions A-C, and the new table was "created" pinned to regions D-F, we'd still observe its replicas start off at A-C. Inventing new contracts seems difficult (at least to this author) given configurations are propagated asynchronously. I agree it's very confusing to users though. The example above holds true for our MR tables where you can technically create them with specific regions in mind, quickly fire off writes to them, and have them (temporarily) land in regions other than those specified because of propagation delays.
I was imagining we'd introduce helpful primitives that would let you wait for a schema object's zone configs to be fully conformed to (WAIT UNTIL ...
) and document the invariant that future writes to that table would conform to a zone config at least as recent as the one that was waited on. For our MR table example, after creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes. Whether the table creation itself should use this WAIT UNTIL
primitive before returning to the caller, I don't know.
A use case that comes to mind where this splitting may be important is CREATE TABLE AS
. It seems plausible that somebody would like to wait for the table ranges to split off and get properly replicated before beginning to load it with data.
One thought that comes to mind is that (in the spanconfig world) we won't propagate the state until we get a closed timestamp. We won't get a closed timestamp potentially for a while. Perhaps something that would be valuable would be some way to request that a timestamp be closed eagerly at the present over the whole span upon commit. Then we could have the schema change job wait for span configs to be applied before doing anything. Even that may not be enough.
I guess the latency is independent of the waiting. I think we'll want to solve both problems at some point.
A use case that comes to mind where this splitting may be important is
CREATE TABLE AS
. It seems plausible that somebody would like to wait for the table ranges to split off and get properly replicated before beginning to load it with data.
Backup restoration too. The issue that spawned this was a restore test failure where a node crash during restore caused quorum loss (the system range which the table split off from hadn't finished upreplicating yet following cluster creation). The upreplication activity would presumably also cause restoration to take longer.
Perhaps we need some tooling or mode to allow customers to wait for up (or down) replication after creating tables but before using them in a production capacity. The tool one might lean on is the replication reports. I don't really feel this is a schema problem.
Maybe. My initial thought here was that, for a given zone config, we could create a standby-range with that configuration, and split off schema entities from the appropriate range which would then already be in the correct configuration. We could possibly implement that today with the existing KV primitives. But maybe that's too simplistic, I haven't given this much thought.
This issue isn't just specific to
num_replicas
, it's for everything. If a table was split off from another that was pinned to regions A-C, and the new table was "created" pinned to regions D-F, we'd still observe its replicas start off at A-C. Inventing new contracts seems difficult (at least to this author) given configurations are propagated asynchronously. I agree it's very confusing to users though.
That's a good point, probably worth taking all of these considerations into account.
I was imagining we'd introduce helpful primitives that would let you wait for a schema object's zone configs to be fully conformed to (
WAIT UNTIL ...
) and document the invariant that future writes to that table would conform to a zone config at least as recent as the one that was waited on. For our MR table example, after creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes. Whether the table creation itself should use thisWAIT UNTIL
primitive before returning to the caller, I don't know.
That might help. I'm coming at this from the loss of quorum-angle though, where we'd really like to avoid being in a low-RF configuration at all, where node loss could wreck the cluster. But at least surfacing it to the user would be a good start.
From a UX perspective, this is a "SURPRISE!" situation where, not knowing the guts of what's happening enough, come to the table saying "this should never happen" 🤷 So I don't think that this can simply be documented away., especially given this inherits all zone configs.
re: For our MR table example, after creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes.
how long would something like this take? Said another way, what's the mechanism that would for the new table's ranges apply the zone configs that the user expects them to have?
After creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes.
How long would something like this take? Said another way, what's the mechanism that would for the new table's ranges apply the zone configs that the user expects them to have?
I'm not sure, we haven't done it. I'm guessing we could get it down to under a second.
For our MR table example, after creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes.
It seems like we might want to do the moral equivalent of this for the RESTORE use case at least.
After creating the table a user could optionally wait for all the newly split off replicas of that table to be in conformance before issuing writes.
How long would something like this take? Said another way, what's the mechanism that would for the new table's ranges apply the zone configs that the user expects them to have?
I'm not sure, we haven't done it. I'm guessing we could get it down to under a second.
How hard would it be to stub something out, pressure test it and time how long it takes? We could use RESTORE
as a cheap way to get a lot of those CREATE TABLE
in succession per Steven's comment.
I'm also just trying to think through the scenario. Would users be upset if a CREATE TABLE
statement takes second(s) rather than ms? I don't think so but something we can pressure test. In a RESTORE
scenario I feel like (without evidence) that data loading is the bulk of time spent rather than creating tables unless it's a v. small restore.
To productionize any kind of polling, we'll want something like #70614 to be able to selectively use low closed timestamp target durations for the span configs infrastructure. It's not super difficult to generate some back of the envelope numbers, let's do it after we land + stabilize the last few pieces of #67679.
sg. thanks
Should we consider delaying splitting off ranges for out of conformance data. This would work well for the split queue, but I'm not as sure for an AdminSplitRequest coming from other parts of the system. For internal operations it is more efficient to create the ranges with the correct config rather than creating them and later fixing them.
cc @cockroachdb/disaster-recovery
As seen in #71377, a
CREATE TABLE
withnum_replicas = 3
can start out with a lower replication factor if it is split off from a range that happens to have a lower replication factor. The split range essentially inherits the replication factor and replicas from the LHS of the split, and then up/downreplicates as appropriate. This can make the table vulnerable to quorum loss until it is fully upreplicated.Granted, this may be a subtle point since quorum loss on any range can cause cluster unavailability, so the low RF on the LHS range is already precarious. However, one could imagine scenarios where this could cause problems, e.g. having scratch tables or other low-importance tables with RF=1 mixed in with production tables with higher RF, where one could nuke the RF=1 ranges in case of quorum loss.
To avoid surprises, a new table should always start out with the configured RF, e.g. by having standby ranges with the appropriate RF for creating new tables.
To reproduce:
roachprod
cluster:a
with an explicit RF=1:b
that inherits RF=3 from thedefaultdb
zone config:Observe the cluster logs to see the table starting out with a 1-replica range and then upreplicating:
This does not happen when
a
starts out with RF=3:Jira issue: CRDB-10874