Closed danhhz closed 3 years ago
@tbg Does this accurately reflect our conversation? @bdarnell This is still pretty handwave-y, but I'd love some initial feedback if you have a minute.
@danhhz yes it does! Not sure what your thinking here was, but feels that there doesn't have to be anything special about "major" versions (those corresponding to releases) except that by convention the major version usually doesn't have a hook associated to it (because this hook won't have run by the time the major version is active, so it's better put in an earlier hook). For example, if we have the versions
19.1 (active) 19.1-1 19.1-2 19.1-3
(i.e. we're in the 19.2 cycle, and some migrations exist, but no 19.2 release version yet) then the data migration mechanism would move from 19.1-0 (the initial version) to 19.1-1, 19.1-2, 19.1-3 as you'd expect. Later if, that cluster got restarted into a binary that also has the release version 19.2 (=19.2-0), it'd move to that. I'm not sure whether the automatic upgrade mechanism does this today, but it should (and I hope it already does).
Were you planning to have an individual job for each transition (19.1-1 to 19.1-2 for example, and we only use jobs for transitions that have a hook)? This also would let us account for the duration of each step and make obvious to the user where we're at. This seems both most straightforward and conceptually easiest to me, better than trying to group multiple steps into one job.
As an aside, we're also planning to move to "negative" unstable versions at some point, so the above history would have taken the form
19.2-(-999) 19.2-(-998) 19.2-(-997) 19.2 (=19.2-0)
instead, so that a 19.2-alpha won't pretend to be compatible with a 19.1 release. I don't think that affects your plans here at all since in both you just move "step by step" through the unstable versions in the order in which they appear (and sort).
The hook, if given, runs after the feature gate is enabled.
What happens at the end of the hook? That is, how does the code and/or the admin learn that everything is finished and it's possible to move on to the next version? Do we now gossip two cluster versions, one tracking the feature gates and one for the migration completion?
The admin runs one command that triggers a job for all necessary migrations. I've been imagining that the job steps through each version. For each step, it first communicates the version to the code, unblocking feature gates (I'd like to have stronger guarantees about this than we've had in the past, how this works is the part that is currently most handwave-y), then runs the hook, then moves on to the next one. Progress would be communicated to the user via the job. I don't think two versions are necessary.
So then the new step in the upgrade process will be something like "check the jobs page to make sure the migration job from the previous upgrade has finished before starting a new upgrade"? We should also consider scripting for this for embedded users who need to do upgrades in a hands-off way (and might bundle multiple upgrades in a relatively short time frame, and/or have pathological conditions that could cause migrations to get hung for a very long time).
Given that this unblocks the shiny new features as it goes, I was imagining the upgrade docs would have a step at the end directing the user to monitor the job until it finishes. But you're right that we should mention it at the beginning, too, just in case.
I'm in favor of making it scriptable. There's some bikeshedding to be done about whether it's one SQL command that blocks for the entire time or one to kick off the job and one to wait on it (IIRC the bulk io team has been favoring the latter recently).
One blocking command (strawman, ignore the syntax):
> COMMIT TO UPGRADE;
<waits a while>
Two commands (strawman, ignore the syntax):
> COMMIT TO UPGRADE;
job_id
12345
> BLOCK ON JOB 12345;
<waits a while>
We could also wrap this SQL in a cli tool if it makes it easier for embedded uses.
I share your concerns about multiple upgrades in a short time frame. In fact, this design was largely worked backward by starting with these assumptions: a) we'll need to build long-running migrations at some point b) we should make it really hard to mess up when upgrading one version and c) there should be some sane story for upgrading multiple versions.
I'm not entirely happy with how much this achieves (c) but at some point we fundamentally have to choose between blocking on startup if they're going too quickly or requiring long-running migrations to work while spanning multiple major versions (which makes them less useful for eliminating tech debt). I should make this more explicit above, but my current thinking is that if you start on major version X, roll to Y, then start rolling to Z without the Y upgrade being committed, then the Z nodes would block on startup.
Hung migrations is a good concern, I hadn't thought of it. I'm not sure what we could do besides putting something in whatever alerting we build and your idea to document checking in on it at the beginning of the next upgrade. Any opinions here?
Nathan pointed out in our 1:1 that my thinking around nodes rejoining after being partitioned off when a feature gate bump is pushed out is probably too harsh. @nvanbenschoten do you mind writing up your concern here so I'm sure I have it right?
I like the two command "block on job" variant because it's extensible to other kinds of jobs. (syntax strawman: SHOW UPGRADE STATUS instead of COMMIT TO UPGRADE. Don't use the verb COMMIT for something other than transaction commits)
then start rolling to Z without the Y upgrade being committed, then the Z nodes would block on startup.
Maybe the Z nodes should crash instead of blocking, which is more likely to trigger rolling downgrades in the deployment tooling. Either crashing or blocking is probably the best we can do here. Blocking is only better than crashing if we can be confident that some of the Y nodes will stick around to complete the process.
I'm not sure what we could do besides putting something in whatever alerting we build and your idea to document checking in on it at the beginning of the next upgrade. Any opinions here?
I don't think there's much we can do besides documenting it and providing tools to check the status and wait.
I whiteboarded the details of this with @tbg today and we realized that a lot of the scope I was hoping to cut to get this into 19.2 is not able to be cut. So, I now think this is a long shot. I’m going to keep working on the prototype and see where I get
Note to self: another possible use of this from @ajkr
Also I forgot to mention this but Pebble might need to be compatible with multiple RocksDB versions, simply because upgrading to 20.1 doesn't necessarily mean the data was rewritten with Pebble by the time a user upgrades to 20.2. A full compaction would be needed to guarantee that. In fact maybe they turn on Pebble in 20.1 and it sees data written by whatever RocksDB version we were using in 1.0. Is there anything preventing this?
Another possible use case for this would be the raft storage engine (https://github.com/cockroachdb/cockroach/issues/7807). If we’re introducing a dedicated storage engine for “raft stuff”, there has to be a cut over point for each node running a crdb version with this dedicated engine code to scoop up all existing raft data from the existing storage engine, and funneling it into the new one. I had put down some thoughts two years ago here how this could be done.
@irfansharif that can't just be done at startup? It's not clear from the RFC why this needs to be coordinated across replicas, given that the Raft log is currently stored in the unreplicated keyspace.
It can/should be, by "cutover point" I was looking at a node-centric view, not a cluster wide one. Mentioned it all here because the proposal above talks about unifying startup migrations with cluster versions.
Can we do it at startup though? Some of the other motivating examples could be done at startup, but we avoid long-running migrations at startup to protect against users rolling onto a version too fast. I'd be uncomfortable reading all raft data from one engine and writing it into another at startup, if that's what's being discussed here.
to protect against users rolling onto a version too fast.
Do you mind elaborating on this? I don't quite follow. We are discussing reading all raft data from one engine and writing it into another at startup.
If you're doing a vX to vX+1 migration, you roll each node onto the next version. Ideally, you roll one node, wait for it to health check, and then roll the next node, repeating until you've done them all.
The worrysome case is a user that rolls them without the health checks. If we do too much work when first starting a node in the new version, then rolling them too fast will result in a cluster that is down for as long as the migrations take. Obviously, the user should be doing the right thing here, but extended total unavailability when they mess up is pretty bad.
Another thing to consider here is that it's likely difficult to maintain the ability to roll back to vX with an at-startup migration to a new raft log engine.
Is there anymore discussion about this? This would be very useful for the FK migration work right now (we want to upgrade all table descriptors as a migration).
I have it on my backburner to finish the prototype, but noone should block on me for anything. @irfansharif has expressed an interest in this area at one point, dunno if anything has changed there.
The issue with table descriptors (for my own reference).
There's a new FK representation (as of 19.2) and we want to make sure all the table descriptors have it in 20.1. If old table descriptors (the 19.1 representation) are left lying around, we have to keep maintaining the "upgrade on read" code path introduced in 19.2. For a running cluster, there's currently the possibility that certain tables would have not been read since 19.1 and thus would still be identified using the old representation.
Having the migration story here ironed out would reduce this maintenance burden + build confidence in table descriptor version upgrades.
Handling jobs from 2.0 clusters may also be relevant: https://github.com/cockroachdb/cockroach/blob/d4166813ab419334d7c479367ba9d38b7212ed2e/pkg/jobs/registry.go#L766
It will be good to remove case from the codebase.
One which @dt has brought up lately is that any migration which touches table descriptors must consider the fact that old table descriptors may be restored from an earlier backup. This can be handled by selectively re-applying the migration during RESTORE, but right now it's up to the developer to remember to implement this. It would be ideal to build this concept into the design of the long-running migration system so we don't have this liability.
Something @ajwerner and I discussed in #44007 was introducing some sort of version referencing system on table descriptors, which would also require a migration like this.
Given we're now working on this: wouldn’t it be super nice if there was a way to write code where it was clear, by the structure, that code around xyz
should be removed as soon as the major version is cut?
Consider "raft log entries being included in snapshots" for example. We no longer do this as of v19.2, and most recently did it in v19.1. 19.2 code then necessary had to deal with the possibility of log entries being included as part of snapshots, but we no longer need this in v20.1. I’ve only seen a random one-liner comment referring to the fact that this deletion work should be done before v20.1 is cut, but it’d be nice if it was more systematic than that.
RFC here: #48843.
the various Raft migrations (RangeAppliedState, unreplicated truncated state, etc) which all boil down to "run something through Raft until they're all active on all ranges but 99.99% sure they're all active already anyway"
https://github.com/cockroachdb/cockroach/pull/58088 uses the long running migrations infrastructure proposed in #48843 (x-ref linked PRs in #56107) to onboard exactly the above.
Backward compatibility is a necessary part of storage infrastructure, but we'd like to minimize the associated technical debt. Initially our migrations were one-off, required complicated reasoning, and under tested (example: the format version stuff that was added for column families and interleaved tables). Over time, we've added frameworks to help with this, but there's one notable gap in our story: long-running data migrations.
An example that's come up a number of times is rewriting all table descriptors in kv. An extreme example that we may have to do one day might be rewriting all data on disk.
Note that the format versions mentioned above do a mini-migration on each table descriptor as it enters the node (from kv, gossip, etc). Nothing guarantees that all table descriptors eventually get rewritten. So even though this has been around since before 1.0, the format version migrations have to stay around as permanent technical debt. The FK migration currently underway will have a similar problem.
The format version code is a small amount of debt, but it'd be nice to get rid of it. Other examples are not so simple. The learner replicas work in 19.2 replaces preemptive snapshots, but after we stop using preemptive snapshots, we need to completely flush them out of the system before the code can be deleted. One of these places is an interim state between when a preemptive snapshot has been written to rocksdb and when raft has caught up enough to make it into a real replica. To flush these out, after we stop sending preemptive snapshots, we'll need to iterate each range descriptor in kv and ensure that it is finished being added or is removed.
More examples from @tbg:
We currently have two main frameworks for migrations. They go by different names in various places, but I'll call them startup migrations and cluster versions.
Startup Migrations
Startup migrations (in package pkg/sqlmigrations and called "Cluster Upgrade Tool" by the RFC) are used to ensure that some backwards-compatible hook is run before the code that needs it. An example of this is adding a new system table. A node that doesn't know about a system table simply doesn't use it, so it's safe to add one in a mixed version cluster.
When the first node of the new version starts up and tries to join the cluster, it notices that the migration hasn't run, runs it, and saves the result durably in kv. Any nodes that start after see that the migration has been run and skip it.
If a second node of the new version tries to join the cluster before the migration has completed, it blocks until the migration finishes. This means that startup migrations need to be relatively quick. In an ideal world, every user would do version upgrades with a rolling restart, waiting for each node to be healthy and settled before moving on. But it's not Making Data Easy if a non-rolling restart blocks every node on some extremely long startup migration, causing a total outage.
On the other hand, by running at startup, all code in the new version can assume that startup migrations have been run. This means there doesn't need to be a fallback path and that any old code can be deleted immediately.
Cluster Versions
Cluster versions are a special case of our gossip-based configuration settings. They allow for backward-incompatible behaviors to be gated by a promise from the user that they'll never downgrade. An example of this is adding a field to an existing rpc request protobuf. A node doesn't want to send such a request until it is sure that it will be handled correctly on the other end. Because of protobuf semantics, if it went to an old node, the field would be silently ignored.
Cluster versions tie together two concepts. First, a promise to the user that we don't have to worry about rpc or disk compatibility with old versions anymore. Second, a feature gate that is unlocked by that promise. There is an ordered list of these, one for each feature being gated.
Because the feature gate is initially off when a new version is rolled onto, each check of the gate needs to have a fallback. New features can return an error informing the user to finish the upgrade and bump the version, but other uses need to keep old code around for a release cycle.
Aside: To make it easier for users that don't want the control, the cluster version is automatically bumped some period of time after all nodes have been upgraded. A setting is available to disable this for users that want to keep the option to roll back until they've manually signed off on the new version.
Data Migrations
Summary: I propose a partial unification of these two systems. To avoid having three migration frameworks, they will be based on and replace Cluster Versions. Separate the two ClusterVersion concepts described above so that we can execute potentially long-running hooks in between.
The interface presented to the user is essentially the same, but slightly expanded. After rolling onto a new version of cockroach, some features are not available until the cluster version is bumped. Either the user does this manually or the auto-upgrade does. Instead of the version bump finishing immediately, a system job is created to do the upgrade, enabling the feature gates as it progresses. This is modeled as a system job both to make sure only one coordinator is running as well as exposing progress to the user. Once the job finishes (technically as each step finishes) the gated features become available to the user.
The interface presented to CockroachDB developers is also mostly the same. Each version in the list is now given the option of including a hook, which is allowed to take a while (TODO be more specific) to finish.
Details:
IsActive(feature)
check will always return true on every node from now on. This was not previously guaranteed and the implementation is mostly tricky around handling nodes that are unavailable when the gate is pushed.Side node: This is all very nearly the long-term vision laid out in https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20161024_cluster_upgrade_tool.md
Note to self: https://reviewable.io/reviews/cockroachdb/cockroach/38932#-LlaULyp9sd2JS_tyaKi:-LlaULyp9sd2JS_tyaKj:bfh2kib