etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.69k stars 9.75k forks source link

Plans for v3.4.20 release #14232

Closed ahrtr closed 2 years ago

ahrtr commented 2 years ago

We intentionally skipped back-porting some bug fixes in 3.4.19 in order to minimize the impact. Since 3.4.19 is already released and validated to be working well, so we can consider to release 3.4.20 to backport some PRs.

Pipelines/test

There are still some flaky test failures from time to time. As previous release, any fix to stabilize the pipeline is welcome.

Issues/PRs that should be included in 3.4.20

Issues/PRs not considered in 3.4.20

Please feel free to chime in if you think any PR/issues need to be investigated or included in 3.4.20. Please also feel free to let us know if anyone has any concerns or comment. cc @ptabor @serathius @spzala @hexfusion @mitake @lavacat @endocrimes @dims @liggitt @neolit123

chaochn47 commented 2 years ago

Nice to backport: https://github.com/etcd-io/etcd/pull/12335

ahrtr commented 2 years ago

Nice to backport: #12335

Thanks for the feedback. I just had a quick review on the PR, and it's a minor change and should be safe to backport. Please feel free to deliver a PR for this.

vivekpatani commented 2 years ago

@ahrtr https://github.com/etcd-io/etcd/pull/14241

tangcong commented 2 years ago

@ahrtr #14169

ahrtr commented 2 years ago

@ahrtr #14169

Thanks for the feedback. I agree the PR can improve the performance/throughput, but it also added a new flag --max-concurrent-streams which usually we shouldn't backport. We need to get consensus on this one. Please kindly provide your opinion (thumb up or down?). thx.

Please read https://github.com/etcd-io/etcd/pull/14169#discussion_r917154243 if you want to quickly understand the context of the PR.

Personally, I incline to backporting it to 3.4.

tangcong commented 2 years ago

thanks for the feedback. it is actually a bug, pr #8238 only solves a part of the problem (etcd does not enable tls), we have encountered a lot of user feedback (users using APISIX, accessing etcd cluster users through HTTP 1.1x protocol). @ahrtr

ahrtr commented 2 years ago

I added 14169 into the list, please let me know if anyone have any concerns. @serathius @spzala @ptabor

ahrtr commented 2 years ago

I pinged the original author of PR 12846, but got no response so far. Is anyone interested in backporting it to 3.4? thx

vivekpatani commented 2 years ago

@ahrtr https://github.com/etcd-io/etcd/pull/14246 - have backported the requested PR. I maybe lacking understanding, but from the looks of it, this function isn't called anywhere (even in the original PR), am I doing this correctly? Thanks. @ptabor @spzala @serathius

ahrtr commented 2 years ago

@ahrtr #14246 - have backported the requested PR. I maybe lacking understanding, but from the looks of it, this function isn't called anywhere (even in the original PR), am I doing this correctly? Thanks. @ptabor @spzala @serathius

Thanks @vivekpatani for the PR. Are you talking about RemoveMatchFile? It's supposed to be only called on etcd startup.

vivekpatani commented 2 years ago

@ahrtr I think I had some misunderstanding, the PR is now fixed/mergeable, please feel free to review when you have a minute, thank you!

chaochn47 commented 2 years ago

https://github.com/etcd-io/etcd/pull/13435/? @ahrtr This PR is safe to backport in my opinion.

ahrtr commented 2 years ago

13435? @ahrtr This PR is safe to backport in my opinion.

Agreed. Please feel free to deliver a PR, thx.

mborsz commented 2 years ago

Can we also consider backporting PR https://github.com/etcd-io/etcd/pull/12871? It seems be be quite safe to backport change and it will both reduce memory consumption and reduce work in latency sensitive applierV3backend.Apply.

ahrtr commented 2 years ago

Can we also consider backporting PR #12871? It seems be be quite safe to backport change and it will both reduce memory consumption and reduce work in latency sensitive applierV3backend.Apply.

It should have already been backported in 12888

serathius commented 2 years ago

@ahrtr https://github.com/etcd-io/etcd/pull/12762 is fix for https://github.com/etcd-io/etcd/issues/12680, in the end the solution was done in one, but three PRs:

As all of them change how pending requests are handled during leader election I would prefer to merge all of them to avoid this logic to diverge between branches. Partial backport means that we could end up not really fixing it.

ramses commented 2 years ago

~Hi @ahrtr ,~

~I have the backport for #13435 ready. However, I currently do not have permission to push a branch to etcd-io/etcd.git.~

endocrimes commented 2 years ago

@ramses you'll need to make a "fork" on GitHub, then push to that, and open the PR across repos. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request

ramses commented 2 years ago

Thanks @endocrimes , realized that immediately after :-)

The PR for #13435: https://github.com/etcd-io/etcd/pull/14254

ahrtr commented 2 years ago

@ahrtr #12762 is fix for #12680, in the end the solution was done in one, but three PRs:

* [raft: postpone MsgReadIndex until first commit in the term #12762](https://github.com/etcd-io/etcd/pull/12762)

* [etcdserver: resend ReadIndex request on empty apply request #12795](https://github.com/etcd-io/etcd/pull/12795)

* [Read index retry #12780](https://github.com/etcd-io/etcd/pull/12780)

As all of them change how pending requests are handled during leader election I would prefer to merge all of them to avoid this logic to diverge between branches. Partial backport means that we could end up not really fixing it.

Thanks @serathius for the reminder. I will take care of all of them myself later.

ahrtr commented 2 years ago

I saw the following errors multiple times in 3.4 pipeline, but never get time to have a deep dive,

Any help is welcome! Thanks!

ahrtr commented 2 years ago

Although there are still some issues which need further investigation, we have already finished backporting all existing PRs.

@endocrimes Do you have bandwidth to run Jepsen test on release-3.4?

endocrimes commented 2 years ago

@ahrtr sure šŸ‘

ahrtr commented 2 years ago

I may release 3.4.20 sometime next week when all remaining tasks(see below) are done,

  1. The failures discovered in 3.4 pipeline are fixed. Only https://github.com/etcd-io/etcd/issues/14256 (@ramses @SimFG) is a blocker to me.
  2. All tests, including Jepsen test (@endocrimes ) and manual test in my local environment (@ahrtr ), are done. We need to think about how to enhance the release process in future, especially how to qualify our test process/utilities.

@serathius @spzala @ptabor @hexfusion Please kindly let me know if you have any comments or concerns.

spzala commented 2 years ago

@ahrtr thank you! No concerns, and sounds great.

ahrtr commented 2 years ago

@ahrtr sure šŸ‘

@endocrimes any update on the Jepsen test on release-3.4?

ahrtr commented 2 years ago

https://github.com/etcd-io/etcd/issues/14256 isn't a blocker any more, please see my comment https://github.com/etcd-io/etcd/issues/14256#issuecomment-1202083883

endocrimes commented 2 years ago

@ahrtr running them now :)

vivekpatani commented 2 years ago

Hi @ahrtr @lavacat @serathius found one last one that may need to be back ported? https://github.com/etcd-io/etcd/pull/13824 Please let me know if that is the case. Thanks.

ahrtr commented 2 years ago

Hi @ahrtr @lavacat @serathius found one last one that may need to be back ported? #13824 Please let me know if that is the case. Thanks.

Thanks @vivekpatani .

etcd 3.5.4 (and main) is using DialContext, while etcd 3.4.19 (and release-3.4 latest code) is using Dial. So release-3.4 doesn't have the issue https://github.com/etcd-io/etcd/issues/13810.

I agree DialContext is better than Dial because Dial has already been deprecated. But Let's just keep it as it's, because we should only cherry pick bug fixes into 3.4.

ahrtr commented 2 years ago

Talked to @endocrimes , no any new issue found during the Jepsen test. Will release 3.4.20 soon.

ahrtr commented 2 years ago

3.4.20 is just released, https://github.com/etcd-io/etcd/releases/tag/v3.4.20 .

Thanks everyone in the CC list. It's really the result of great team work!

Thanks again !!!

vivekpatani commented 2 years ago

Thanks for doing this.

etcd 3.5.4 (and main) is using DialContext, while etcd 3.4.19...

etcd-3.5 latest release is not using DialContext or am I missing something?

@ahrtr

ahrtr commented 2 years ago

@vivekpatani The reason is the PR https://github.com/etcd-io/etcd/pull/13824 has been cherry picked to release-3.5 yet. Please feel free to deliver a PR for that. thx

In release-3.5, DialContext is used in the transport, but Dial is used in the ValidateSecureEndpoints.

In release-3.4, both places are using Dial.