Open YoyinZyc opened 4 years ago
Thank you! I have reviewed the changes some time ago. I will re-review the changes.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Are we still working on this?
Are we still working on this?
Sorry I dropped it for a long time. Basically, the server side is almost done except the integration test which I'll add it soon later. The client and etcdctl part could be done in v3.5 or v3.6 because v3.6 is the first version that will support downgrade to v3.5.
cc @ptabor
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Seems like there are a couple of things missing from this issue, mainly the:
@YoyinZyc Could you confirm this is the case and is this something you are going to continue working on? Thank you!
@serathius might eventually continue work on this.
Nice, thanks for the update! I started experimenting a bit with this as we wanted to test it for 3.5 -> 3.4 downgrade. Will report my findings here.
I already patched the v3 client as I couldn't get it working otherwise if @serathius hasn't started yet, is it okay for me to open a PR for it?
I'm happy to finish work on cluster downgrade. @lilic thanks for the help, it would be great to collaborate. Feel free to send the PR just let's make sure to split the work to avoid conflicts.
Thanks, sounds great! Here is my PR https://github.com/etcd-io/etcd/pull/13083 to add Downgrade to the client, I had to do that as etcd doesn't have rpc reflection I couldn't use grpcurl
to initiate the downgrade.
As mentioned there, the downgrade failed for me when I tried it, happy to provide further steps I did if you are interested.
I tried to perform a downgrade from v3.5.0 -> v3.4.16, but ran into a few issues:
logs:
invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.4.16","determined-cluster-version":"3.5"
snapshot:
etcd-dump-db iterate-bucket snapshot-downgrade.db cluster
key="downgrade", value="{\"target-version\":\"3.4.0\",\"enabled\":true}"
key="clusterVersion", value="3.5.0"
Maybe I ran some steps incorrectly, I can also detail them here again if needed?
{"level":"warn","ts":"2021-06-09T13:44:56.234+0200","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-bdf3dd55-977e-4fff-bddc-2117f67b81bd/xxxxx:2379","attempt":9,"error":"rpc error: code = Unavailable desc = etcdserver: not capable"}
I am wondering if the https://github.com/etcd-io/etcd/blob/main/CHANGELOG-3.5.md#storage-format-changes changes have anything to do with that, and we just don't support this kind of a downgrade from v3.5 -> v3.4?
Hey, thanks for catching this up. I will take a look.
I managed to reproduce same issues as @lilic on v3.5 -> v3.4 downgrade. Looks like downgrade policy changes https://github.com/etcd-io/etcd/pull/11919 were introduced in v3.5, but it shouldn't be an issue as cluster version should be lowered allowing older clusters to join.
Logic in v3.4 should allow to downgrade assuming that determines that cluster version is determined to be 3.4. From logs
invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.4.16","determined-cluster-version":"3.5"
Looks like this doesn't happen. Possibly because v3.4 runs from data created by v3.5 which saved to bolt that cluster version is v3.5. Will need to dig dipper here.
Whats is more worrying for me was the results of perspective v3.6 -> v3.5 downgrade. Downgrade went through, but the after replacing one cluster member, all other paniced. I recreated all members, cluster looked healthy, so I decided to repeat the procedure by upgrading back to v3.6. Unfortunatelly looks like the downgrade process was not marked as finished even though all members were in correct version.
I'm still learning the procedure and may not have done everything correctly, but looks like downgrade needs much more work and testing before releasing.
Ok, found original issue that was meant to fix v3.5 -> v3.4 downgrade. https://github.com/etcd-io/etcd/pull/12988 Will try to retry the steps based on it.
Analysed what exactly happends on v3.6 -> v3.5 downgrade:
M1, M2 - member 1 and 2 M3L - member 3 and a lead CV - cluster version DT - Downtgrade target version
Time each row it point in time. Time increases with lower rows. Cells store versions.
M1 | M2 | M3L | CV | DT | comment |
---|---|---|---|---|---|
3.6 | 3.6 | 3.6 | init | ||
3.6 | 3.6 | 3.6 | 3.6 | Leader decides cluster version | |
3.6 | 3.6 | 3.6 | 3.6 | 3.5 | Operator sends “downgrade enable 3.5“ |
3.5 | 3.6 | 3.6 | 3.6 | 3.5 | Operator downgrades M1 |
3.5 | 3.6 | 3.6 | 3.5 | 3.5 | Leader decides cluster version is “3.5” |
3.5 | 3.5 | 3.5 | When applying cluster version Leader and peer crashes as cluster with higher version is not allowed to join the version |
Looks like there is a conflict between logic for selecting cluster version and downgrade. No matter the downgrade cluster version will be imminently changed when a new member with lower cluster version joins.
Proposal: It Members with higher version should be able to join during downgrade.
Reasons:
great sleuthing @serathius! thank you :+1:
Thanks for the update!
Proposal: It Members with higher version should be able to join during downgrade.
+1
1 minor version higher are OK. Do we plan to guard 2 minor versions difference ?
I would say yes, that should be a new downgrade then. I am not sure it would work smoothly from 3.5 -> 3.3, and we would have too much to keep in mind when developing, to not introduce breaking changes.
I guess we could add something like proposed in #13022 to allow more than 1 minor version.
@ptabor Yes, current guarding does that, but even more. We are not allowing joining members with version other then target version, so in case of downgrade from v3.5 -> v3.4, we only allow v3.4. Proposal is to expand this to allow both (v3.4 and v3.5) to handle disconnects.
During downgrade v3.5 -> v3.4, members with v3.6 should not be allowed to join.
Found another two problems, one is related to decision about upgrade being finished second to wal replay.
M1 | M2 | M3L | CV | DT | comment |
---|---|---|---|---|---|
3.6 | 3.6 | 3.6 | init | ||
3.6 | 3.6 | 3.6 | 3.6 | Leader decides cluster version | |
3.6 | 3.6 | 3.6 | 3.6 | 3.5 | Operator sends “downgrade enable 3.5“ |
3.5 | 3.6 | 3.6 | 3.6 | 3.5 | Operator downgrades M1 |
3.5 | 3.6 | 3.6 | 3.5 | 3.5 | Leader decides cluster version is “3.5” |
3.5 | 3.6 | 3.6 | 3.5 | Leader decides that downgrade has finished | |
3.5 | 3.5 | 3.6 | 3.5 | Operator downgrades M2 | |
3.5 | 3.6 | 3.5 | M2 crashes when replaying wal |
Logs from M2 when replaying wal:
{"level":"info","ts":"2021-06-11T14:57:38.439+0200","caller":"membership/cluster.go:523","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","from":"3.0","to":"3.6"}
{"level":"fatal","ts":"2021-06-11T14:57:38.439+0200","caller":"membership/downgrade.go:70","msg":"invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.5.0-downgrade.0","determined-cluster-version":"3.6","stacktrace":"..."}
go.etcd.io/etcd/server/v3/etcdserver/api/membership.mustDetectDowngrade
etcd.io/etcd/server/etcdserver/api/membership/downgrade.go:70
go.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).SetVersion
etcd.io/etcd/server/etcdserver/api/membership/cluster.go:540
go.etcd.io/etcd/server/v3/etcdserver.(*applierV2store).Put
etcd.io/etcd/server/etcdserver/apply_v2.go:101
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyV2Request
etcd.io/etcd/server/etcdserver/apply_v2.go:135
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntryNormal
etcd.io/etcd/server/etcdserver/server.go:2184
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply
etcd.io/etcd/server/etcdserver/server.go:2116
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries
etcd.io/etcd/server/etcdserver/server.go:1357
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll
etcd.io/etcd/server/etcdserver/server.go:1179
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8
etcd.io/etcd/server/etcdserver/server.go:1111
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run
etcd.io/etcd/pkg/schedule/schedule.go:157
Leaders decides that clusters has been downgraded when cluster version equal target version. As cluster version is defined as minimal server version this happens imminently when first downgraded member joins. First this seem too fast for me as technically operator didn't finish the downgrade, they just downgraded one node. On the other hand from cluster perspective, cluster version was lowered which means that cluster was brought to mixed state similar that would happen doing upgrade.
Proposal: Use server version to determine if cluster have been downgraded.
Reasons:
This issue lead me to think about what is expected to happen when operator cancels the upgrade half way. Technically one node is enough to bring the cluster version down, which means that at this time canceling downgrade doesn't mean anything as downgrade has already happen. I think we should discuss what behavior is expected from downgrade process so we can rethink the original design.
Implementing the proposal makes whole downgrade process work. I have successfully rerun the downgrade twice (on separate clusters), still there was issue with wall that makes me feel that we are not done here.
When member tries to rejoin the cluster it panics when replaying wal. Technically it should be possible for them to join, they have version matching the cluster version. Then what happens?
Downgrade was canceled and it was there to see it, so they saved this information to backed. When replaying the wal, member compares (cluster version from wal, it's local server version and latest downgrade info from backend). I'm still trying to figure out how it all works but those conclusions are based on testing case when upgrade succeeded after fixing first problem. In that case wal problem didn't happen, I assume the reason is that downgrade was not canceled, making the assertion pass. This still doesn't fix the problem as we cannot replay historical upgrades stored in wal with latest binary version.
Please let me know if this analysis and conclusions make sense. In the meantime I will try to learn more about how wal, state in memory and state stored in backend interacts.
First issue
This is definitely an issue, I think in the original design is that, during the process of downgrading, cluster version will be locked without changing to the lowest version, then we have a separate target version. The cluster version only changes when all nodes are changed to the target version. "cluster version is defined as minimal server version " This was there for the benefit of upgrading. Before, etcd only allow upgrading, and when a cluster is upgrading to a higher version, we want the cluster version to remain at the lowest version until all nodes are upgraded, which is not correct for downgrading scenario. So I think what we really want is for the cluster version to stay at the original version, not the lowest version, for both upgrading or downgrading scenario.
And yes, mustDetectDowngrade is there to prevent downgrade since etcd does not support downgrade before. It needs to be relaxed now that we support it.
For the determinatino about if cluster have been downgraded, I think your proposal of using server version is good. You may also want to make sure cluster version doens't change as the application who uses etcd may have some dependencies on the cluster version.
Second issue
I have a feeling that @YoyinZyc has seen that before, but I have not sure. @YoyinZyc do you remember this?
@serathius
Sorry I didn't follow this thread for a long time. My knowledge may be outdated.
As for your first issue, I remember my implementation will not let downgrade be cancelled when the cluster version is equal to target version. The leader will keep monitoring its followers and only if all the followers reach to target version the downgrade cancel request will be sent out and the downgrade status will be marked as done.
My big guess on why M2 is crushed based on the log you pasted here is the cluster version.
{"level":"fatal","ts":"2021-06-11T14:57:38.439+0200","caller":"membership/downgrade.go:70","msg":"invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.5.0-downgrade.0","determined-cluster-version":"3.6","stacktrace":"..."}
Ideally, once you downgrade M1 to 3.5, the determined-cluster-version
should be 3.5. Then M2 wants to join in as 3.5, the mustDetectDowngrade
latch should not reject because the cluster version and server version should both be 3.5.
There may be some bug around restoring cluster version which cause this issue.
So I think what we really want is for the cluster version to stay at the original version, not the lowest version, for both upgrading or downgrading scenario.
For downgrade we could technically lock the cluster version for the upgrade process, but this would not work with cancel. If upgrade can be canceled we would automatically revert to original logic "pick lowest server version". With cancel, downgrading one node and canceling would be enough to consider cluster downgraded. I think we should settle on different definition of cluster version, a definition that has stable value no matter if we are upgrading, downgrading or recovering from canceled downgrade.
I would propose to change definition of cluster version from minimum version of server, to be defined as a version of quorum of members. So two in three node clusters and three in five node clusters.
This has benefits like:
Downside:
To eliminate downside from proposal 1 we can not only use quorum but also make upgrades and downgrades symmetrical, with only difference that upgrades are automatically initialized by leader when upgraded node is detected and downgrades are initialized by operator manually. In this case DowngradeTarget (DT), would be just changed to SecondaryVersion(SV). When set it would indicate that cluster is during version change and prevent members with version outside of [CV,SV] to join and operator to start downgrade in cluster with mixed versions.
Example states for v3.5
One tricky thing is what to do when we reach majority and cluster version is updated. In that case we could swap versions CV and SV. We would need a additional bool to store direction of version change, but this way we can ensure that after reaching quorum, operator will need to finish the current downgrade/upgrade before starting a new one.
Example upgrade M1 | M2 | M3L | CV | SV | comment |
---|---|---|---|---|---|
3.4 | 3.4 | 3.4 | 3.4 | init | |
3.5 | 3.4 | 3.4 | 3.4 | Operator upgrades member M1 | |
3.5 | 3.4 | 3.4 | 3.4 | 3.5 | Leader detects upgrade and sets version target. This prevents operator from initializing downgrade when cluster is in mixed state. |
3.5 | 3.5 | 3.4 | 3.4 | 3.5 | Operator upgrades member M2 |
3.5 | 3.5 | 3.4 | 3.5 | 3.4 | Based on majority leader upgrades cluster version by swapping it with secondary version. This way clusters we can prevent starting upgrade to v3.6 before upgrading M3 member. |
3.5 | 3.5 | 3.5 | 3.5 | 3.4 | Operator upgrades member M3 |
3.5 | 3.5 | 3.5 | 3.5 | Leader notices that there are no remaining members with 3.4 version and clears up secondary version, opening cluster for further upgrade |
This idea still needs some thought, but I think it will work as it introduces symmetrical mechanism for preventing large version drift from only upgrade to also work for downgrade. Decision if member can join is done by a member itself and it can make this decision only based on it's local state (cluster version, downgrade target/secondary version, binary version). To guarantee that member version difference in cluster is max <=1 that also work for downgrades we need to store at two values (min and max). No matter what names of versions we use (secondary version, target version etc), I think we need to use two values.
Please let me know if this goes in good direction or we should go back to original design. I would be happy to put it in a document where we can continue the discussions. No matter which approach we take I think we need this document to make the downgrade process understood by everyone.
@serathius Thank you for the proposals. I think proposal 2 sounds good but I have a question. If we introduce upgrade policy(I think it should be released in v3.6) , is the backend capable with previous version(v3.5)? If not, how do we upgrade from 3.5 to 3.6? The reason I have this concern is that when we add downgrade in 3.5, v3.5 cannot downgrade to v3.4 because backend incapability. I am not sure will you proposal have the similar problem?
In this discussion I focused on the control logic for coordinating zero-downtime downgrades and upgrades. This control logic should enforce invariant that upgrades should not be done in by more than one version. This works with assumption that backed is +/-1 version compatible and backend can be downgraded/updated by just running Etcd server on it.
Introducing backend compatibility guarantees that are clearly defined, tested and tooled is definitely required to make the downgrade/upgrade process reliable. I will propose something here too.
@YoyinZyc Thanks for bringing my attention to backend or etcd data backward compatibility. After giving thing a thought I found that problem is big enough that it deserves it's own dedicated design.
I have wrote down the design for etcd data versioning. It proposes to introduces idea of Storage version and proposes some restrictions to ensure that older etcd binaries can safety load data during downgrade without breaking Raft consensus requirements or causing unexpected behaviors. Please take a look https://docs.google.com/document/d/1yD0GDkxqWBPAax6jLZ97clwAz2Gp0Gux6xaTrtJ6wHE/edit?usp=sharing
cc @ptabor @lilic @YoyinZyc @jingyih @hexfusion Please let me know what you think
Created dedicated issue to track progress of implementation. https://github.com/etcd-io/etcd/issues/13168
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Previous downgrade related PR(#11362, #11492) are both too big to review. Let's open this issue to track downgrade implementation progress this time. I'll break it down into small tasks and get them merged step by step. Design doc: https://docs.google.com/document/d/1mSihXRJz8ROhXf4r5WrBGc8aka-b8HKjo2VDllOc6ac/edit#heading=h.u32uhgogqjq2
TODO (EDIT by @serathius): Server side:
cc @jingyih @gyuho @wenjiaswe @jpbetz @lavalamp