Etcd cluster downgrade - Githubissues

YoyinZyc commented 4 years ago

Previous downgrade related PR(#11362, #11492) are both too big to review. Let's open this issue to track downgrade implementation progress this time. I'll break it down into small tasks and get them merged step by step. Design doc: https://docs.google.com/document/d/1mSihXRJz8ROhXf4r5WrBGc8aka-b8HKjo2VDllOc6ac/edit#heading=h.u32uhgogqjq2

TODO (EDIT by @serathius): Server side:

[x] Define downgrade rpc proto api. (https://github.com/etcd-io/etcd/pull/11715)
[x] Store downgrade info to backend (https://github.com/etcd-io/etcd/pull/11725)
[x] Implement downgrade validate, enable and cancel (https://github.com/etcd-io/etcd/pull/11801)
[x] Apply downgrade policy (https://github.com/etcd-io/etcd/pull/11919)
[x] Add http handler to enable downgrade info communication between each member (https://github.com/etcd-io/etcd/pull/12099)
[ ] Implement etcd storage versioning https://github.com/etcd-io/etcd/issues/13168
[x] Add downgrade api to client
[X] Add downgrade commands to etcdctl
[ ] Add e2e tests
- [x] Test simple downgrade upgrade scenario
- [ ] Test downgrade cancellation
[ ] Test linearizability during downgrade
[ ] Documentation

cc @jingyih @gyuho @wenjiaswe @jpbetz @lavalamp

jingyih commented 4 years ago

Thank you! I have reviewed the changes some time ago. I will re-review the changes.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

lavalamp commented 3 years ago

Are we still working on this?

YoyinZyc commented 3 years ago

Are we still working on this?

Sorry I dropped it for a long time. Basically, the server side is almost done except the integration test which I'll add it soon later. The client and etcdctl part could be done in v3.5 or v3.6 because v3.6 is the first version that will support downgrade to v3.5.

wenjiaswe commented 3 years ago

cc @ptabor

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

lilic commented 3 years ago

Seems like there are a couple of things missing from this issue, mainly the:

Add downgrade commands
Add downgrade api under Maintenance interface

@YoyinZyc Could you confirm this is the case and is this something you are going to continue working on? Thank you!

ptabor commented 3 years ago

@serathius might eventually continue work on this.

lilic commented 3 years ago

Nice, thanks for the update! I started experimenting a bit with this as we wanted to test it for 3.5 -> 3.4 downgrade. Will report my findings here.

I already patched the v3 client as I couldn't get it working otherwise if @serathius hasn't started yet, is it okay for me to open a PR for it?

serathius commented 3 years ago

I'm happy to finish work on cluster downgrade. @lilic thanks for the help, it would be great to collaborate. Feel free to send the PR just let's make sure to split the work to avoid conflicts.

lilic commented 3 years ago

Thanks, sounds great! Here is my PR https://github.com/etcd-io/etcd/pull/13083 to add Downgrade to the client, I had to do that as etcd doesn't have rpc reflection I couldn't use grpcurl to initiate the downgrade.

As mentioned there, the downgrade failed for me when I tried it, happy to provide further steps I did if you are interested.

lilic commented 3 years ago

I tried to perform a downgrade from v3.5.0 -> v3.4.16, but ran into a few issues:

I am not sure if this is meant to work for v3.5 -> v3.4, if yes we might need to fix a couple of things. Or was this intended for v3.5+. This is due to the cluster version is not updated to 3.4 but stays at 3.5 even after a downgrade is initiated. From what I can tell, the 3.4 doesn't know anything about downgrade so it tries to stay at 3.5 but it's running 3.4.16 at the time.

logs:

invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.4.16","determined-cluster-version":"3.5"

snapshot:

etcd-dump-db iterate-bucket snapshot-downgrade.db cluster
key="downgrade", value="{\"target-version\":\"3.4.0\",\"enabled\":true}"
key="clusterVersion", value="3.5.0"

Maybe I ran some steps incorrectly, I can also detail them here again if needed?

After some hacky modifications to v3.4.16 the cluster started but ran into issues when trying to read/write:

{"level":"warn","ts":"2021-06-09T13:44:56.234+0200","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-bdf3dd55-977e-4fff-bddc-2117f67b81bd/xxxxx:2379","attempt":9,"error":"rpc error: code = Unavailable desc = etcdserver: not capable"}

I am wondering if the https://github.com/etcd-io/etcd/blob/main/CHANGELOG-3.5.md#storage-format-changes changes have anything to do with that, and we just don't support this kind of a downgrade from v3.5 -> v3.4?

serathius commented 3 years ago

Hey, thanks for catching this up. I will take a look.

serathius commented 3 years ago

I managed to reproduce same issues as @lilic on v3.5 -> v3.4 downgrade. Looks like downgrade policy changes https://github.com/etcd-io/etcd/pull/11919 were introduced in v3.5, but it shouldn't be an issue as cluster version should be lowered allowing older clusters to join.

Logic in v3.4 should allow to downgrade assuming that determines that cluster version is determined to be 3.4. From logs

invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.4.16","determined-cluster-version":"3.5"

Looks like this doesn't happen. Possibly because v3.4 runs from data created by v3.5 which saved to bolt that cluster version is v3.5. Will need to dig dipper here.

Whats is more worrying for me was the results of perspective v3.6 -> v3.5 downgrade. Downgrade went through, but the after replacing one cluster member, all other paniced. I recreated all members, cluster looked healthy, so I decided to repeat the procedure by upgrading back to v3.6. Unfortunatelly looks like the downgrade process was not marked as finished even though all members were in correct version.

I'm still learning the procedure and may not have done everything correctly, but looks like downgrade needs much more work and testing before releasing.

serathius commented 3 years ago

Ok, found original issue that was meant to fix v3.5 -> v3.4 downgrade. https://github.com/etcd-io/etcd/pull/12988 Will try to retry the steps based on it.

serathius commented 3 years ago

Analysed what exactly happends on v3.6 -> v3.5 downgrade:

M1, M2 - member 1 and 2 M3L - member 3 and a lead CV - cluster version DT - Downtgrade target version

Time each row it point in time. Time increases with lower rows. Cells store versions.

M1	M2	M3L	CV	DT	comment
3.6	3.6	3.6			init
3.6	3.6	3.6	3.6		Leader decides cluster version
3.6	3.6	3.6	3.6	3.5	Operator sends “downgrade enable 3.5“
3.5	3.6	3.6	3.6	3.5	Operator downgrades M1
3.5	3.6	3.6	3.5	3.5	Leader decides cluster version is “3.5”
3.5			3.5	3.5	When applying cluster version Leader and peer crashes as cluster with higher version is not allowed to join the version

Looks like there is a conflict between logic for selecting cluster version and downgrade. No matter the downgrade cluster version will be imminently changed when a new member with lower cluster version joins.

Proposal: It Members with higher version should be able to join during downgrade.

Reasons:

Cluster downgrade must happen to force everyone to use older etcd protocol
It can happen that members can go temporarily down during upgrade and need to rejoin.

hexfusion commented 3 years ago

great sleuthing @serathius! thank you :+1:

lilic commented 3 years ago

Thanks for the update!

Proposal: It Members with higher version should be able to join during downgrade.

+1

ptabor commented 3 years ago

1 minor version higher are OK. Do we plan to guard 2 minor versions difference ?

lilic commented 3 years ago

I would say yes, that should be a new downgrade then. I am not sure it would work smoothly from 3.5 -> 3.3, and we would have too much to keep in mind when developing, to not introduce breaking changes.

I guess we could add something like proposed in #13022 to allow more than 1 minor version.

serathius commented 3 years ago

@ptabor Yes, current guarding does that, but even more. We are not allowing joining members with version other then target version, so in case of downgrade from v3.5 -> v3.4, we only allow v3.4. Proposal is to expand this to allow both (v3.4 and v3.5) to handle disconnects.

During downgrade v3.5 -> v3.4, members with v3.6 should not be allowed to join.

serathius commented 3 years ago

Found another two problems, one is related to decision about upgrade being finished second to wal replay.

M1	M2	M3L	CV	DT	comment
3.6	3.6	3.6			init
3.6	3.6	3.6	3.6		Leader decides cluster version
3.6	3.6	3.6	3.6	3.5	Operator sends “downgrade enable 3.5“
3.5	3.6	3.6	3.6	3.5	Operator downgrades M1
3.5	3.6	3.6	3.5	3.5	Leader decides cluster version is “3.5”
3.5	3.6	3.6	3.5		Leader decides that downgrade has finished
3.5	3.5	3.6	3.5		Operator downgrades M2
3.5		3.6	3.5		M2 crashes when replaying wal

Logs from M2 when replaying wal:

{"level":"info","ts":"2021-06-11T14:57:38.439+0200","caller":"membership/cluster.go:523","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","from":"3.0","to":"3.6"}
{"level":"fatal","ts":"2021-06-11T14:57:38.439+0200","caller":"membership/downgrade.go:70","msg":"invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.5.0-downgrade.0","determined-cluster-version":"3.6","stacktrace":"..."}

go.etcd.io/etcd/server/v3/etcdserver/api/membership.mustDetectDowngrade
    etcd.io/etcd/server/etcdserver/api/membership/downgrade.go:70
go.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).SetVersion
    etcd.io/etcd/server/etcdserver/api/membership/cluster.go:540
go.etcd.io/etcd/server/v3/etcdserver.(*applierV2store).Put
    etcd.io/etcd/server/etcdserver/apply_v2.go:101
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyV2Request
    etcd.io/etcd/server/etcdserver/apply_v2.go:135
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntryNormal
    etcd.io/etcd/server/etcdserver/server.go:2184
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply
    etcd.io/etcd/server/etcdserver/server.go:2116
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries
    etcd.io/etcd/server/etcdserver/server.go:1357
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll
    etcd.io/etcd/server/etcdserver/server.go:1179
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8
    etcd.io/etcd/server/etcdserver/server.go:1111
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run
    etcd.io/etcd/pkg/schedule/schedule.go:157

First issue

Leaders decides that clusters has been downgraded when cluster version equal target version. As cluster version is defined as minimal server version this happens imminently when first downgraded member joins. First this seem too fast for me as technically operator didn't finish the downgrade, they just downgraded one node. On the other hand from cluster perspective, cluster version was lowered which means that cluster was brought to mixed state similar that would happen doing upgrade.

Proposal: Use server version to determine if cluster have been downgraded.

Reasons:

Downgrading one node is not enough for downgrade to be successful, if at any point this node would disconnect, leader would bump back the cluster version.
Having downgrade finish after one node would be confusing for operators and could lead to bad patterns as technically it's now possible to downgrade cluster again to even lower version.

This issue lead me to think about what is expected to happen when operator cancels the upgrade half way. Technically one node is enough to bring the cluster version down, which means that at this time canceling downgrade doesn't mean anything as downgrade has already happen. I think we should discuss what behavior is expected from downgrade process so we can rethink the original design.

Implementing the proposal makes whole downgrade process work. I have successfully rerun the downgrade twice (on separate clusters), still there was issue with wall that makes me feel that we are not done here.

Second issue

When member tries to rejoin the cluster it panics when replaying wal. Technically it should be possible for them to join, they have version matching the cluster version. Then what happens?

Downgrade was canceled and it was there to see it, so they saved this information to backed. When replaying the wal, member compares (cluster version from wal, it's local server version and latest downgrade info from backend). I'm still trying to figure out how it all works but those conclusions are based on testing case when upgrade succeeded after fixing first problem. In that case wal problem didn't happen, I assume the reason is that downgrade was not canceled, making the assertion pass. This still doesn't fix the problem as we cannot replay historical upgrades stored in wal with latest binary version.

Please let me know if this analysis and conclusions make sense. In the meantime I will try to learn more about how wal, state in memory and state stored in backend interacts.

wenjiaswe commented 3 years ago

First issue

This is definitely an issue, I think in the original design is that, during the process of downgrading, cluster version will be locked without changing to the lowest version, then we have a separate target version. The cluster version only changes when all nodes are changed to the target version. "cluster version is defined as minimal server version " This was there for the benefit of upgrading. Before, etcd only allow upgrading, and when a cluster is upgrading to a higher version, we want the cluster version to remain at the lowest version until all nodes are upgraded, which is not correct for downgrading scenario. So I think what we really want is for the cluster version to stay at the original version, not the lowest version, for both upgrading or downgrading scenario.

And yes, mustDetectDowngrade is there to prevent downgrade since etcd does not support downgrade before. It needs to be relaxed now that we support it.

For the determinatino about if cluster have been downgraded, I think your proposal of using server version is good. You may also want to make sure cluster version doens't change as the application who uses etcd may have some dependencies on the cluster version.

Second issue

I have a feeling that @YoyinZyc has seen that before, but I have not sure. @YoyinZyc do you remember this?

YoyinZyc commented 3 years ago

@serathius

Sorry I didn't follow this thread for a long time. My knowledge may be outdated.

As for your first issue, I remember my implementation will not let downgrade be cancelled when the cluster version is equal to target version. The leader will keep monitoring its followers and only if all the followers reach to target version the downgrade cancel request will be sent out and the downgrade status will be marked as done.

My big guess on why M2 is crushed based on the log you pasted here is the cluster version.

{"level":"fatal","ts":"2021-06-11T14:57:38.439+0200","caller":"membership/downgrade.go:70","msg":"invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.5.0-downgrade.0","determined-cluster-version":"3.6","stacktrace":"..."}

Ideally, once you downgrade M1 to 3.5, the determined-cluster-version should be 3.5. Then M2 wants to join in as 3.5, the mustDetectDowngrade latch should not reject because the cluster version and server version should both be 3.5.

There may be some bug around restoring cluster version which cause this issue.

serathius commented 3 years ago

So I think what we really want is for the cluster version to stay at the original version, not the lowest version, for both upgrading or downgrading scenario.

For downgrade we could technically lock the cluster version for the upgrade process, but this would not work with cancel. If upgrade can be canceled we would automatically revert to original logic "pick lowest server version". With cancel, downgrading one node and canceling would be enough to consider cluster downgraded. I think we should settle on different definition of cluster version, a definition that has stable value no matter if we are upgrading, downgrading or recovering from canceled downgrade.

Proposal - cluster version based on quorum

I would propose to change definition of cluster version from minimum version of server, to be defined as a version of quorum of members. So two in three node clusters and three in five node clusters.

This has benefits like:

No need to lock cluster version during downgrade, as it will not change until majority of members are downgraded. This has also benefit of making the cancel operation stable.
Cluster version will not change when upgrading/downgrading to broken version as we will not be able to reach quorum.

Downside:

It's harder to prevent version 2 version difference between members. For example during upgrade v3.4 -> v3.5, it would be enough to upgrade three out of five members. As cluster version will bump to v3.5, we could initialize v3.5 -> v3.6 upgrade. As result we would have both v3.4 and v3.6 members in one cluster.

Proposal - Use a version target for both downgrade and upgrade

To eliminate downside from proposal 1 we can not only use quorum but also make upgrades and downgrades symmetrical, with only difference that upgrades are automatically initialized by leader when upgraded node is detected and downgrades are initialized by operator manually. In this case DowngradeTarget (DT), would be just changed to SecondaryVersion(SV). When set it would indicate that cluster is during version change and prevent members with version outside of [CV,SV] to join and operator to start downgrade in cluster with mixed versions.

Example states for v3.5

Normal operation: CV = v3.5, SV = nil -> only v3.6 members can join
Upgrade: CV=v3.5, SV=v3.6 -> prevents operator from initializing downgrade to v3.4
Downgrade: CV=v3.5, SV=v3.4 -> prevents members with version v3.6 from joining

One tricky thing is what to do when we reach majority and cluster version is updated. In that case we could swap versions CV and SV. We would need a additional bool to store direction of version change, but this way we can ensure that after reaching quorum, operator will need to finish the current downgrade/upgrade before starting a new one.

Example upgrade M1	M2	M3L	CV	SV	comment
3.4	3.4	3.4	3.4		init
3.5	3.4	3.4	3.4		Operator upgrades member M1
3.5	3.4	3.4	3.4	3.5	Leader detects upgrade and sets version target. This prevents operator from initializing downgrade when cluster is in mixed state.
3.5	3.5	3.4	3.4	3.5	Operator upgrades member M2
3.5	3.5	3.4	3.5	3.4	Based on majority leader upgrades cluster version by swapping it with secondary version. This way clusters we can prevent starting upgrade to v3.6 before upgrading M3 member.
3.5	3.5	3.5	3.5	3.4	Operator upgrades member M3
3.5	3.5	3.5	3.5		Leader notices that there are no remaining members with 3.4 version and clears up secondary version, opening cluster for further upgrade

This idea still needs some thought, but I think it will work as it introduces symmetrical mechanism for preventing large version drift from only upgrade to also work for downgrade. Decision if member can join is done by a member itself and it can make this decision only based on it's local state (cluster version, downgrade target/secondary version, binary version). To guarantee that member version difference in cluster is max <=1 that also work for downgrades we need to store at two values (min and max). No matter what names of versions we use (secondary version, target version etc), I think we need to use two values.

Please let me know if this goes in good direction or we should go back to original design. I would be happy to put it in a document where we can continue the discussions. No matter which approach we take I think we need this document to make the downgrade process understood by everyone.

YoyinZyc commented 3 years ago

@serathius Thank you for the proposals. I think proposal 2 sounds good but I have a question. If we introduce upgrade policy(I think it should be released in v3.6) , is the backend capable with previous version(v3.5)? If not, how do we upgrade from 3.5 to 3.6? The reason I have this concern is that when we add downgrade in 3.5, v3.5 cannot downgrade to v3.4 because backend incapability. I am not sure will you proposal have the similar problem?

serathius commented 3 years ago

In this discussion I focused on the control logic for coordinating zero-downtime downgrades and upgrades. This control logic should enforce invariant that upgrades should not be done in by more than one version. This works with assumption that backed is +/-1 version compatible and backend can be downgraded/updated by just running Etcd server on it.

Introducing backend compatibility guarantees that are clearly defined, tested and tooled is definitely required to make the downgrade/upgrade process reliable. I will propose something here too.

serathius commented 3 years ago

@YoyinZyc Thanks for bringing my attention to backend or etcd data backward compatibility. After giving thing a thought I found that problem is big enough that it deserves it's own dedicated design.

I have wrote down the design for etcd data versioning. It proposes to introduces idea of Storage version and proposes some restrictions to ensure that older etcd binaries can safety load data during downgrade without breaking Raft consensus requirements or causing unexpected behaviors. Please take a look https://docs.google.com/document/d/1yD0GDkxqWBPAax6jLZ97clwAz2Gp0Gux6xaTrtJ6wHE/edit?usp=sharing

cc @ptabor @lilic @YoyinZyc @jingyih @hexfusion Please let me know what you think

serathius commented 3 years ago

Created dedicated issue to track progress of implementation. https://github.com/etcd-io/etcd/issues/13168

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

etcd-io / etcd

Etcd cluster downgrade #11716

First issue

Second issue

Proposal - cluster version based on quorum

Proposal - Use a version target for both downgrade and upgrade