Election.Observe may report the wrong leader

jcferretti commented 2 weeks ago

Bug report criteria

[X] This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
[X] This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
[X] You have read the etcd bug reporting guidelines.
[X] Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

The implementation of observe does a get on the election key (prefix) https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/concurrency/election.go#L178 and if that get returns zero kvs, it does a watch: https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/concurrency/election.go#L190

The watch may return multiple PUT events, in which case the first one is returned (note the break statement in the if body matching a PUT event type): https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/concurrency/election.go#L201

It seems to me returning the first PUT event can be wrong if there are multiple PUTs in the returned wr.Events, unless I am misunderstanding this (which is possible, since there is a fair amount of complexity here). So please bear with me, and help me understand if this is not broken.

I think instead of returning the first one, the PUT event with the lowest create revision should be returned, otherwise we risk returning the wrong leader. Returning the event with the lowest create revision would be consistent with what is done if the initial get is not empty. When the initial get is not empty, the first KV in the get range is returned https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/concurrency/election.go#L210

since the get was done with options for ascending order of creation https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/concurrency/election.go#L178 https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/op.go#L462

The only way I could see ignoring any other PUTs but the first not being an issue is if there was a guarantee that there will never be more than one PUT event in the returned wr.Events at that point in the code. I don't believe that is guaranteed, since watches are eventual consistent and there can be a network issue between the first get and this subsequent watch; if at the time of the first get there is no election leader (I don't mean etcd cluster leader, I mean elected candidate in the election API), and then a transient network error occurs, a retry of the watch may succeed at a time when there are several candidates already registered for the election, which I believe would make the watch return multiple PUTs.

What did you expect to happen?

N/A

How can we reproduce it (as minimally and precisely as possible)?

N/A

Anything else we need to know?

N/A

Etcd version (please run commands below)

3.5.12

Etcd configuration (command line flags or environment variables)

N/A

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

N/A

Relevant log output

No response

ahrtr commented 1 week ago

Recommend to read:

jcferretti commented 1 week ago

Can you please confirm you did read this part of my issue writeup:

The only way I could see ignoring any other PUTs but the first not being an issue is if there was a guarantee that there will never be more than one PUT event in the returned wr.Events at that point in the code. I don't believe that is guaranteed, since watches are eventual consistent and there can be a network issue between the first get and this subsequent watch; if at the time of the first get there is no election leader

and that there is a guarantee that the code as it exists today can't see more than one PUT. Since watchers are eventual consistent, and there can be a network issue between the first get returning empty, and the watch to the election prefix, I still think this is a problem.

jcferretti commented 1 week ago

Let me try to be very explicit about the potential problem I see. Perhaps this makes it easier to point out what part of the sequence of steps below is wrong.

The initial GET with the campaign prefix returns empty at revision X
The watch for the campaign prefix at revision (X+1) is started.
A network issue prevents etcd client from talking to etcd server for a window of time T
During T, campaign candidate 1 shows up and registers its PUT at revision X+1
During T, campaign candidate 2 shows up after candidate 1 and registers its put at revision X+2
The network issue between etcd client and etcd server is resolved; the combination of gRPC underlying infrastructure and etcd client side code reconnects/retries the watch asking for updates after (X+1). The watch returns all events at or after X+1, which includes up to X+2 (and potentially anything after, but that's irrelevent), which includes the PUTs for candidates 1 and 2.

After reading https://etcd.io/docs/v3.5/learning/api_guarantees/#watch-apis carefully, I don't see a guarantee that the events for (X+1) and (X+2) can't be collapsed.

The "atomic" guarantee talks about events in the same revision. Does not apply for (X+1) -> (X+2).
The "reliable" guarantee talks about events missing; it says if a < b < c and you get a and c, you need to get b also. It doesn't say whether you will get (a,b) and then c, or (a,b,c).

To me, to be able to guarantee that (X+1) and (X+2) will show up in different updates to the watch, it would require etcd server to be able to replay full history in the exact same sequence that happened at the server. You seem to imply that a watch guarantees that, full history replay one revision at a time (one-revision-per-watch-notification at a time). Is that the guarantee? Because that's not explicit from the list of guarantees in https://etcd.io/docs/v3.5/learning/api_guarantees/#watch-apis

jcferretti commented 1 week ago

Also, to provide some context as for why I care: my team and myself have tried to use the concurrency API from java. We ran into issues that we reported earlier:

https://github.com/etcd-io/etcd/issues/17623

which at the point we realized were known; you answered to our issue report, thanks, and provided a link to an earlier related report:

https://github.com/etcd-io/etcd/issues/17502

you mentioned in your comment to the first report above that the concurrency API is known to be buggy and that we shouldn't use it; so we are in the process of re-implementing our java code to implement campaign/observe/proclaim/resign client side; our plan is to use that ourselves first and then donate the code to jetcd. As part of understanding/translating the go code, we (believe) we found this issue.

If it is not an issue and there is a guarantee the watcher will never return more than one event in this case, it would be good to understand why. We have been trying to use the master election features in etcd from java for a while and we have run into several issues; we have invested effort on this way beyond our initial expectation/plans. We are grateful for the etcd infrastructure and we have made it a critical component of our own system to outsource the fault tolerance of our system to it (which makes enormous sense, since we don't want to reinvent the fault tolerance wheels here). But these master election APIs look to be somewhat less commonly used parts of etcd and we are a bit uncertain to what level we can trust them; the history of bug reports here seems to validate some of those concerns.

ahrtr commented 1 week ago

The initial GET with the campaign prefix returns empty at revision X

The watch for the campaign prefix at revision (X+1) is started.

A network issue prevents etcd client from talking to etcd server for a window of time T

During T, campaign candidate 1 shows up and registers its PUT at revision X+1

During T, campaign candidate 2 shows up after candidate 1 and registers its put at revision X+2

Eventually I got your point. It can be easily reproduced with some change something like below,

$ git diff
diff --git a/client/v3/concurrency/election.go b/client/v3/concurrency/election.go
index ac1303dd8..9193a6e1c 100644
--- a/client/v3/concurrency/election.go
+++ b/client/v3/concurrency/election.go
@@ -18,6 +18,7 @@ import (
        "context"
        "errors"
        "fmt"
+       "time"

        pb "go.etcd.io/etcd/api/v3/etcdserverpb"
        "go.etcd.io/etcd/api/v3/mvccpb"
@@ -180,6 +181,9 @@ func (e *Election) observe(ctx context.Context, ch chan<- v3.GetResponse) {
                        return
                }

+               fmt.Println("Sleeping 20 seconds")
+               time.Sleep(20 * time.Second)
+
                var kv *mvccpb.KeyValue
                var hdr *pb.ResponseHeader

@@ -195,6 +199,11 @@ func (e *Election) observe(ctx context.Context, ch chan<- v3.GetResponse) {
                                        return
                                }
                                // only accept puts; a delete will make observe() spin
+                               fmt.Printf("###### len(wr.Events) = %d\n", len(wr.Events))
+                               for i, ev := range wr.Events {
+                                       fmt.Printf("###### %d, %+v\n", i, ev)
+                               }
+
                                for _, ev := range wr.Events {
                                        if ev.Type == mvccpb.PUT {
                                                hdr, kv = &wr.Header, ev.Kv

Rough test steps:

Execute ./etcdctl elect hello -l in one terminal;
Execute ./etcdctl elect hello v1 in another terminal, Ctrl + C;
Execute ./etcdctl elect hello v2, Ctrl + C;
Execute ./etcdctl elect hello v3

Around 20 seconds later, got output like below in the first terminal,

Sleeping 20 seconds
###### len(wr.Events) = 5
###### 0, &{Type:PUT Kv:key:"hello/694d9026db986209" create_revision:2 mod_revision:2 version:1 value:"v1" lease:7587879443756507657  PrevKv:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
###### 1, &{Type:PUT Kv:key:"hello/694d9026db98620f" create_revision:3 mod_revision:3 version:1 value:"v2" lease:7587879443756507663  PrevKv:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
###### 2, &{Type:DELETE Kv:key:"hello/694d9026db986209" mod_revision:4  PrevKv:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
###### 3, &{Type:DELETE Kv:key:"hello/694d9026db98620f" mod_revision:5  PrevKv:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
###### 4, &{Type:PUT Kv:key:"hello/694d9026db986218" create_revision:6 mod_revision:6 version:1 value:"v3" lease:7587879443756507672  PrevKv:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
hello/694d9026db986209
v1

Unfortunately I do not see any document to clearly describe the expected behavior. I think we should get that clarified firstly.

jcferretti commented 1 week ago

Thanks for your investment of time here, I appreciate it.

I believe a higher level way to describe my worry is just to say "watches would have to be linearizable for the existing code to work". And it seems to me they can't be for performance reasons (which makes sense). This may support the case, in that it implies watch is only sequential/serializable, not linearizable:

https://github.com/etcd-io/etcd/issues/18141 https://github.com/etcd-io/etcd/pull/17918#discussion_r1593799944

etcd-io / etcd