etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.89k stars 9.78k forks source link

Investigate why MemberReplace failpoint flakes on release-3.4 #18929

Open serathius opened 1 day ago

serathius commented 1 day ago

Bug report criteria

What happened?

In last robustness meeting we identified 3 flakes for memberReplace

All happening on release-3.4 and TestRobustnessExploratory/KubernetesHighTraffic/ClusterOfSize3/MemberReplace test

What did you expect to happen?

Issue not being specific to release-3.4

How can we reproduce it (as minimally and precisely as possible)?

There is no way to select failpoints via test name, but you can modify allFailpoints in test/robustness/failpoint/failpoint.go to leave only MemberReplace

And run it with GO_TEST_FLAGS='-v --run TestRobustnessExploratory/KubernetesHighTraffic/ClusterOfSize3 --count 100 --failfast --timeout 1h' make test-robustness-release-3.4

Anything else we need to know?

No response

Etcd version (please run commands below)

release-3.4 branch

Etcd configuration (command line flags or environment variables)

# paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

```console $ etcdctl member list -w table # paste output here $ etcdctl --endpoints= endpoint status -w table # paste output here ```

Relevant log output

No response

joshuazh-x commented 1 day ago

I can take a look at this.

joshuazh-x commented 1 day ago

Without PR #11639, MemberList returns local membership configuration without linearizable guarantee. The removed member may show up in the member list response. The issue is fixed in 3.5 and above so it shall be specific to 3.4.

Release 3.4 https://github.com/etcd-io/etcd/blob/435ac802b83105be69faa18a931b13a183f1deb1/etcdserver/api/v3rpc/member.go#L90-L93

Release 3.5 https://github.com/etcd-io/etcd/blob/601a8847397b3972fec3a6b9caa17a6cde29ad59/server/etcdserver/api/v3rpc/member.go#L90-L98

ahrtr commented 1 hour ago

Without PR #11639, MemberList returns local membership configuration without linearizable guarantee. The removed member may show up in the member list response.

Thanks for the analysis. One workaround is to issue a linearizable read request in between for 3.4.