etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.69k stars 9.75k forks source link

Plans for v3.5.3 release #13894

Closed serathius closed 2 years ago

serathius commented 2 years ago

Same as previous release https://github.com/etcd-io/etcd/issues/13518. let's agree what work should be included in release and track the work. The list presented below will be evolving as community decides to cut/include some fixes.

Issues/PRs that should be included for v3.5.3:

Issues/PRs not required for v3.5.3:

Please feel free to suggest new additional backports/issues worth investigating so we can discuss if they should be included.

serathius commented 2 years ago

cc @ptabor @ahrtr @spzala

mrueg commented 2 years ago

@serathius not sure if you want to add https://github.com/etcd-io/etcd/pull/13895 as well.

ahrtr commented 2 years ago

I think we need 13838 as well, because it will make the reproduction of the data inconsistency issue easier.

serathius commented 2 years ago

Hey @ahrtr, It would be great to have automatic way to reproduce this issue, however this is not required immediate release. https://github.com/etcd-io/etcd/pull/13838 includes changes that might negatively impact overall stability of functional tests making it hard to merge. I don't plan to make this code mergeable, to avoid investing to much time into functional testing as we might want to change our overall approach to correctness testing after the data inconsistency issue.

The data inconsistency issue should be fairly tested with https://github.com/etcd-io/etcd/pull/13838, but I don't see merging this change as a blocking for v3.5.3.

riyazcom commented 2 years ago

@serathius is it possible to include #13460 "Support multiple values for allowed client and peer TLS identities" ?

serathius commented 2 years ago

@riyazcom Even though not supporting multiple client & peer TLS identities is a limitation of etcd, it is not a bug and introducing it would be basically be an additional feature. Based on that I don't think we should back port.

jayunit100 commented 2 years ago

what is the manual test plan for reproducing this ? is there a k8s centric test we can run ? I was doing a combination of running e2es while rebooting nodes, and didnt see this occur. I think it requires unrestricted API QoS , constrained etcd memory, and a few other tricky knobs.

serathius commented 2 years ago

@jayunit100 please read the original issue https://github.com/etcd-io/etcd/issues/13766#issuecomment-1078897588, it includes description of reproduction and link to PR with code https://github.com/etcd-io/etcd/pull/13838.

Let's continue the discussion there.

spzala commented 2 years ago

@serathius https://github.com/etcd-io/etcd/issues/13692 doesn't seems required in 3.5.3. I am adding triage comments there.

ahrtr commented 2 years ago

13666 is just a question instead of an issue. I already provided answer and closed it. Definitely it isn't required in 3.5.3.

ahrtr commented 2 years ago

Please also remove 13690 from the list per the discussion in the PR.

ahrtr commented 2 years ago

Just closed 13848 with comment 13848#issuecomment-1093449532.

yuyangbj commented 2 years ago

Hi ETCD maintainers,

As the community has alerted the customers not use the ETCD 3.5.[0-2] in production env due to the issue data inconsistency, but is there any plan to speed up the 3.5.3 release considering some products has already blocked by the issue?

serathius commented 2 years ago

@yuyangbj Thanks for your interest in v3.5.3 release. Apart of just "fixing" the issue, we also need to make sure we do proper diligence to test and qualify the release. We would really not want to make a mistake here and be forced to send another alert. Unfortunately this takes time and with only 3-4 people working on the release it will take us some time.

Data inconsistency issue was non-trivial and it took us longer time to figure out a proper fix. Fortunately we finally managed to merge the fix last Friday, so we can start qualifying the release and decide to skip some minor issues proposed here.

Best thing I can recommend for products blocked on the release is stop waiting passively and get involved into the release proces. We need more people to test, document and qualify the release.

dims commented 2 years ago

@serathius Apologies!!

serathius commented 2 years ago

New issue that might impact v3.5.3 release. Please someone take a look https://github.com/etcd-io/etcd/issues/13922

endocrimes commented 2 years ago

https://github.com/etcd-io/etcd/issues/13846 shouldn't be a critical release blocker - but here's a PR https://github.com/etcd-io/etcd/pull/13923

serathius commented 2 years ago

Thanks @endocrimes, agree it's not a critical release blocker, however for now let's assume it will go in, as qualification will take some time.

endocrimes commented 2 years ago

@serathius 👍 - what's necessary for qualification? (not sure if there's a test process that gets followed or if setting up some soak/scale tests out-of-tree would be helpful for example)

serathius commented 2 years ago

That's great question, apparently v3.5 release haven't followed the same qualification as previous releases (https://github.com/etcd-io/etcd/issues/13839#issuecomment-1079823211). With maintainer rotation we lost knowlegde about processes previously executed manually, no mention of qualification testing in https://etcd.io/docs/v3.3/dev-internal/release/.

I would be great to get idea what kind of tests were done before. @xiang90 @jpbetz any recommendation about the testing?

serathius commented 2 years ago

In addition to the qualification mentioned before, would be good to confirm that two data inconsistency issues found were fixed. Instructions for their reproduction:

dims commented 2 years ago

@serathius Have we in the past have some method to try a k8s (presubmit?) testing with a branch of etcd? (curious, if so, then we could try that?)

serathius commented 2 years ago

@serathius Have we in the past have some method to try a k8s (presubmit?) testing with a branch of etcd? (curious, if so, then we could try that?)

We run v3.5.0 in K8s presubmits for long time (we still are), however they didn't find any issues we fixed for v3.5.3 so I expect they are not good qualification method.

endocrimes commented 2 years ago

@serathius I'm also bumping the jepsen etcd tests for 3.5.x too. (Starting with running them against 3.4.x and 3.5.2 as a baseline).

I don't think they're high scale enough tests to catch the corruption issues we hit here, but potentially a nice thing to get running semi continuously somewhere.

yuyangbj commented 2 years ago

@yuyangbj Thanks for your interest in v3.5.3 release. Apart of just "fixing" the issue, we also need to make sure we do proper diligence to test and qualify the release. We would really not want to make a mistake here and be forced to send another alert. Unfortunately this takes time and with only 3-4 people working on the release it will take us some time.

Data inconsistency issue was non-trivial and it took us longer time to figure out a proper fix. Fortunately we finally managed to merge the fix last Friday, so we can start qualifying the release and decide to skip some minor issues proposed here.

Best thing I can recommend for products blocked on the release is stop waiting passively and get involved into the release proces. We need more people to test, document and qualify the release.

Thanks Marek @serathius for the prompt response. And I understand everything needs a complete tests. I do not want to push the release, so I am sorry if it makes a misunderstanding like it. And our downstream team has actively involved to reproduce the data inconsistency issue, will update to here if we have test results.

ahrtr commented 2 years ago

New issue that might impact v3.5.3 release. Please someone take a look #13922

Just resolved in pull/13930, it should be a legacy issue (or an old regression) instead of a regression.

serathius commented 2 years ago

With last data inconsistency issue fixed we can focus on qualification and cutting the scope for the release. Moving https://github.com/etcd-io/etcd/pull/13862 and https://github.com/etcd-io/etcd/issues/13846 as they are not critical. Authors can continue to work on them, however we should prioritize the release.

serathius commented 2 years ago

Current plan for qualification:

We would be happy to accept any other ideas to make sure the release is stable and no new data corruption issues are found.

serathius commented 2 years ago

@shalinmangar in https://github.com/etcd-io/etcd/issues/13839#issuecomment-1085949492 you mentioned that you have experience with jepsen. It would be great if you can help @endocrimes as we cannot reproduce the data inconsistencies from v3.5.2 there.

serathius commented 2 years ago

Qualification work is almost finished, I would like to prepare to start the release process.

First let's collect acts from people involved: @ptabor @spzala @ahrtr @endocrimes Please let me know if there is anything else that should be done before the release.

endocrimes commented 2 years ago

So current state of jepsen testing:

Future plan:


Update: 18:14

Watch and Lock tests failed on my first run - investigating whether it was a jepsen issue or etcd issue.

ptabor commented 2 years ago

That list looks complete to me. Thank you @serathius and @ahrtr.

I think that these items are below the bar (but calling them explicitly):

@serathius: Out of curiosity: Have you been running the repro test for several hours with patched 3.5.2 or just the O(10)min test ?

endocrimes commented 2 years ago

Jepsen Watch test failures seem to be a jepsen issue, but I haven't run into those issues in any of the previous version runs, so I'm... curious.

Moving on to lock tests for now.

endocrimes commented 2 years ago

Locks (leases) seem to be consistently flaky - The lease ends up closing early, and thus the test fails. Gonna investigate further after dinner.

endocrimes commented 2 years ago

Gonna look into this with fresh eyes tomorrow - getting a headache - but it feels like potentially a clojure/client issue

dims commented 2 years ago

Sooo close!!

ahrtr commented 2 years ago

I think it's worth to mention @serathius 's PR pull/13927. Ran the load test against mounted RAM filesystem might be an easier way to reproduce the data inconsistent issue. I will give it a try.

Locks (leases) seem to be consistently flaky - The lease ends up closing early, and thus the test fails. Gonna investigate further after dinner.

Please note that I recently fixed a lease related issue, see pull/13932, which isn't included in 3.5.2. Do you still see flaky on release-3.5 (3.5.3)? If yes, please file an issue, and let's discuss it there.

submit & backport the data-inconsistency repro test

@serathius already submitted a PR for this, and I will take a closer review today & tomorrow.

more defence in depth (e.g. verification that consistency index does not decrease)

Yes, it's a good point. The consistent_index should never decrease based on current design/implementation. But the term might be. I am not totally convinced that the PR pull/13903 fixes the issue issues/13514, because raft should reject a message with old term.

spzala commented 2 years ago

Qualification work is almost finished, I would like to prepare to start the release process.

First let's collect acts from people involved: @ptabor @spzala @ahrtr @endocrimes Please let me know if there is anything else that should be done before the release.

@serathius awesome, thanks so much! It looks good to me. If we are running more tests, that's great, but otherwise, from the positive feedback on the data inconsistency fix and the tests you and @ahrtr ran, +1 for release work. We can iterate over the release with any new findings, post-release feedback and test improvements under discussion.

serathius commented 2 years ago

@serathius: Out of curiosity: Have you been running the repro test for several hours with patched 3.5.2 or just the O(10)min test ?

I run:

serathius commented 2 years ago

Yes, it's a good point. The consistent_index should never decrease based on current design/implementation. But the term might be. I am not totally convinced that the PR pull/13903 fixes the issue issues/13514, because raft should reject a message with old term.

You are right that raft should never allow a message with an old term, however https://github.com/etcd-io/etcd/issues/13514 covers not officially supported downgrade flow, that went around downgrade check, and showed us that term in db cannot be trusted in v3.5. With https://github.com/etcd-io/etcd/pull/13903 goal was not to fix downgrades, but to remove incorrect onlyGrow check that turned out to be incorrectly introduced during rafactor.

ahrtr commented 2 years ago

We might need to get this issue 13937 sorted out before releasing 3.5.3.

serathius commented 2 years ago

@ahrtr Is this a new problem in v3.5? Based on the original issue it seems to be present in v3.3. For v3.5.3 our goal should be addressing new issues that were introduced in v3.5 to ensure there are not regressions when upgrading from v3.4 to v3.5.

Fact that there is a long standing issue is worrying, however it should not block v3.5.3 release.

shalinmangar commented 2 years ago

@shalinmangar in #13839 (comment) you mentioned that you have experience with jepsen. It would be great if you can help @endocrimes as we cannot reproduce the data inconsistencies from v3.5.2 there.

Happy to help. I see @endocrimes has already made great progress. Is there an issue for Jepsen testing? or should we communicate over the Etcd-dev mailing list?

endocrimes commented 2 years ago

@shalinmangar https://github.com/etcd-io/etcd/issues/13939

serathius commented 2 years ago

Talked with @endocrimes, lease flakes should not impact v3.5.3 release. With that I will start the release process.

dims commented 2 years ago

w00t! sounds good @serathius

serathius commented 2 years ago

ok, after 4 hours, with breaks for restoring sanity and eating, I managed to successfully run release script.

ahrtr commented 2 years ago

@ahrtr Is this a new problem in v3.5? Based on the original issue it seems to be present in v3.3. For v3.5.3 our goal should be addressing new issues that were introduced in v3.5 to ensure there are not regressions when upgrading from v3.4 to v3.5.

Fact that there is a long standing issue is worrying, however it should not block v3.5.3 release.

My immediate feeling is it should be legacy issue instead of a regression. Will take a closer look today. I don't think it's a blocker for 3.5.3 either.

ahrtr commented 2 years ago

@ahrtr Is this a new problem in v3.5? Based on the original issue it seems to be present in v3.3. For v3.5.3 our goal should be addressing new issues that were introduced in v3.5 to ensure there are not regressions when upgrading from v3.4 to v3.5.

Fact that there is a long standing issue is worrying, however it should not block v3.5.3 release.

This turns out to be a regression. But the good news is it should not be a big problem. Please see the explanation in pull/13942