Open jcferretti opened 2 weeks ago
Can you please confirm you did read this part of my issue writeup:
The only way I could see ignoring any other PUTs but the first not being an issue is if there was a guarantee that there will never be more than one PUT event in the returned wr.Events at that point in the code. I don't believe that is guaranteed, since watches are eventual consistent and there can be a network issue between the first get and this subsequent watch; if at the time of the first get there is no election leader
and that there is a guarantee that the code as it exists today can't see more than one PUT. Since watchers are eventual consistent, and there can be a network issue between the first get returning empty, and the watch to the election prefix, I still think this is a problem.
Let me try to be very explicit about the potential problem I see. Perhaps this makes it easier to point out what part of the sequence of steps below is wrong.
After reading https://etcd.io/docs/v3.5/learning/api_guarantees/#watch-apis carefully, I don't see a guarantee that the events for (X+1) and (X+2) can't be collapsed.
To me, to be able to guarantee that (X+1) and (X+2) will show up in different updates to the watch, it would require etcd server to be able to replay full history in the exact same sequence that happened at the server. You seem to imply that a watch guarantees that, full history replay one revision at a time (one-revision-per-watch-notification at a time). Is that the guarantee? Because that's not explicit from the list of guarantees in https://etcd.io/docs/v3.5/learning/api_guarantees/#watch-apis
Also, to provide some context as for why I care: my team and myself have tried to use the concurrency API from java. We ran into issues that we reported earlier:
https://github.com/etcd-io/etcd/issues/17623
which at the point we realized were known; you answered to our issue report, thanks, and provided a link to an earlier related report:
https://github.com/etcd-io/etcd/issues/17502
you mentioned in your comment to the first report above that the concurrency API is known to be buggy and that we shouldn't use it; so we are in the process of re-implementing our java code to implement campaign/observe/proclaim/resign client side; our plan is to use that ourselves first and then donate the code to jetcd. As part of understanding/translating the go code, we (believe) we found this issue.
If it is not an issue and there is a guarantee the watcher will never return more than one event in this case, it would be good to understand why. We have been trying to use the master election features in etcd from java for a while and we have run into several issues; we have invested effort on this way beyond our initial expectation/plans. We are grateful for the etcd infrastructure and we have made it a critical component of our own system to outsource the fault tolerance of our system to it (which makes enormous sense, since we don't want to reinvent the fault tolerance wheels here). But these master election APIs look to be somewhat less commonly used parts of etcd and we are a bit uncertain to what level we can trust them; the history of bug reports here seems to validate some of those concerns.
- The initial GET with the campaign prefix returns empty at revision X
- The watch for the campaign prefix at revision (X+1) is started.
- A network issue prevents etcd client from talking to etcd server for a window of time T
- During T, campaign candidate 1 shows up and registers its PUT at revision X+1
- During T, campaign candidate 2 shows up after candidate 1 and registers its put at revision X+2
Eventually I got your point. It can be easily reproduced with some change something like below,
$ git diff
diff --git a/client/v3/concurrency/election.go b/client/v3/concurrency/election.go
index ac1303dd8..9193a6e1c 100644
--- a/client/v3/concurrency/election.go
+++ b/client/v3/concurrency/election.go
@@ -18,6 +18,7 @@ import (
"context"
"errors"
"fmt"
+ "time"
pb "go.etcd.io/etcd/api/v3/etcdserverpb"
"go.etcd.io/etcd/api/v3/mvccpb"
@@ -180,6 +181,9 @@ func (e *Election) observe(ctx context.Context, ch chan<- v3.GetResponse) {
return
}
+ fmt.Println("Sleeping 20 seconds")
+ time.Sleep(20 * time.Second)
+
var kv *mvccpb.KeyValue
var hdr *pb.ResponseHeader
@@ -195,6 +199,11 @@ func (e *Election) observe(ctx context.Context, ch chan<- v3.GetResponse) {
return
}
// only accept puts; a delete will make observe() spin
+ fmt.Printf("###### len(wr.Events) = %d\n", len(wr.Events))
+ for i, ev := range wr.Events {
+ fmt.Printf("###### %d, %+v\n", i, ev)
+ }
+
for _, ev := range wr.Events {
if ev.Type == mvccpb.PUT {
hdr, kv = &wr.Header, ev.Kv
Rough test steps:
./etcdctl elect hello -l
in one terminal;./etcdctl elect hello v1
in another terminal, Ctrl + C;./etcdctl elect hello v2
, Ctrl + C;./etcdctl elect hello v3
Around 20 seconds later, got output like below in the first terminal,
Sleeping 20 seconds
###### len(wr.Events) = 5
###### 0, &{Type:PUT Kv:key:"hello/694d9026db986209" create_revision:2 mod_revision:2 version:1 value:"v1" lease:7587879443756507657 PrevKv:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
###### 1, &{Type:PUT Kv:key:"hello/694d9026db98620f" create_revision:3 mod_revision:3 version:1 value:"v2" lease:7587879443756507663 PrevKv:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
###### 2, &{Type:DELETE Kv:key:"hello/694d9026db986209" mod_revision:4 PrevKv:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
###### 3, &{Type:DELETE Kv:key:"hello/694d9026db98620f" mod_revision:5 PrevKv:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
###### 4, &{Type:PUT Kv:key:"hello/694d9026db986218" create_revision:6 mod_revision:6 version:1 value:"v3" lease:7587879443756507672 PrevKv:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
hello/694d9026db986209
v1
Unfortunately I do not see any document to clearly describe the expected behavior. I think we should get that clarified firstly.
Thanks for your investment of time here, I appreciate it.
I believe a higher level way to describe my worry is just to say "watches would have to be linearizable for the existing code to work". And it seems to me they can't be for performance reasons (which makes sense). This may support the case, in that it implies watch is only sequential/serializable, not linearizable:
https://github.com/etcd-io/etcd/issues/18141 https://github.com/etcd-io/etcd/pull/17918#discussion_r1593799944
Bug report criteria
What happened?
The implementation of observe does a get on the election key (prefix) https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/concurrency/election.go#L178 and if that get returns zero kvs, it does a watch: https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/concurrency/election.go#L190
The watch may return multiple PUT events, in which case the first one is returned (note the break statement in the if body matching a PUT event type): https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/concurrency/election.go#L201
It seems to me returning the first PUT event can be wrong if there are multiple PUTs in the returned
wr.Events
, unless I am misunderstanding this (which is possible, since there is a fair amount of complexity here). So please bear with me, and help me understand if this is not broken.I think instead of returning the first one, the PUT event with the lowest create revision should be returned, otherwise we risk returning the wrong leader. Returning the event with the lowest create revision would be consistent with what is done if the initial get is not empty. When the initial get is not empty, the first KV in the get range is returned https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/concurrency/election.go#L210
since the get was done with options for ascending order of creation https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/concurrency/election.go#L178 https://github.com/etcd-io/etcd/blob/5c3190903fa1115b083e22fe55f86e3a78d82f36/client/v3/op.go#L462
The only way I could see ignoring any other PUTs but the first not being an issue is if there was a guarantee that there will never be more than one PUT event in the returned
wr.Events
at that point in the code. I don't believe that is guaranteed, since watches are eventual consistent and there can be a network issue between the first get and this subsequent watch; if at the time of the first get there is no election leader (I don't mean etcd cluster leader, I mean elected candidate in the election API), and then a transient network error occurs, a retry of the watch may succeed at a time when there are several candidates already registered for the election, which I believe would make the watch return multiple PUTs.What did you expect to happen?
N/A
How can we reproduce it (as minimally and precisely as possible)?
N/A
Anything else we need to know?
N/A
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
N/A
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
N/A
Relevant log output
No response