etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.69k stars 9.76k forks source link

Watch does not fail on complete loss of connectivity #8980

Closed devnev closed 4 years ago

devnev commented 6 years ago

Steps to reproduce:

Expected outcome: After some timeout, the watch will fail. This would allow the caller to propagate an error further up the stack, telling clients to retry elsewhere where there might be a healthy connection.

Observed outcome: The watch blocks forever.

gyuho commented 6 years ago

telling clients to retry elsewhere where there might be a healthy connection.

This is already working (since v3.2.10+, master branch--pre v3.3.0).

You are running a single node, and the watch client is expected to be stuck with the single node, retrying--even with keepalive settings (e.g. ETCDCTL_API=3 etcdctl --debug --keepalive-time=5s --keepalive-timeout=5s watch --prefix "").

As of today (pre v3.3.0, gRPC v1.7.4), the retrial logic is all manual: just try again on transient failure. This was partially because gRPC keepalive does not expose connection state (seems resolved, related https://github.com/grpc/grpc-go/issues/1444). In the future, we may improve this on a single node case.

devnev commented 6 years ago

Sorry, but its not clear to me what you're saying. Is it intended behaviour that the broken watch just sits there doing nothing?

I downloaded v3.2.11 which exposes the same problematic behaviour with my command and does not support the keepalive flags in your command.

I do not understand what you mean by the retrial logic is all manual: just try again on transient failure. Manual in what sense? What is a transient failure?

To be clear, the behaviour I expect to see at some point is: given the correct connection timeout options, a watch that gets disconnected will fail and return to the caller.

xiang90 commented 6 years ago

A temp connection issue should not fail watcher unless you explicitly specify a timeout (but the timeout will also fail long running watchers, so probably not anyone would like to do that) or a keepalive mechanism to kill it.

xiang90 commented 6 years ago

After some timeout, the watch will fail. This would allow the caller to propagate an error further up the stack, telling clients to retry elsewhere where there might be a healthy connection.

You should provide etcd client all available endpoints, and the client expects to re-connect by itself. If you want to control reconnection yourself, you basically need to build your own balancer, which is a non-trivial of effort to get right.

devnev commented 6 years ago

or a keepalive mechanism to kill it.

This is exactly what I'm looking for, and can't find. Trying your etcdctl watch command with all the timeouts, the command does not terminate when i sigstop etcd.

edit: this is with etcd/etcdctl build from master

devnev commented 6 years ago

You should provide etcd client all available endpoints, and the client expects to re-connect by itself. If you want to control reconnection yourself, you basically need to build your own balancer, which is a non-trivial of effort to get right.

I do not want to control reconnection. I want my own service's requests to fail when the service is completely unable to reach etcd, as the service is unable to operate correctly.

devnev commented 6 years ago

Hi - any updates? As far as I can tell this is still unresolved, despite the rather premature closing of the issue.

gyuho commented 6 years ago

@devnev Sorry! I was unclear. And we need better docs around this.

Is it intended behaviour that the broken watch just sits there doing nothing?

It's retrying the single endpoint, since your command only runs a single node cluster. Again, if you want balancer failover, provide other endpoints in --endpoints flag.

command does not terminate when i sigstop etcd.

This is expected. Watch command won't exit on connection loss or server shutdowns. Keepalive in watch was added to detect connection failures and trigger endpoint switches, not to error out. We will document this clearly in v3.3 release.

the service is completely unable to reach etcd

In client-side, there's no way to tell if the failure is transient or permanent. That's why client retries among available endpoints, in case the server comes back.

When etcd server goes down, gRPC client receives rpc error: code = Unavailable desc = transport is closing. Its error code is Unavailable, indicating transient failure from gRPC's point of view. When the client receives obvious server errors (e.g. etcdserver: mvcc: required revision has been compacted), the watch client terminates. We might expose connection state to client (tracked in https://github.com/coreos/etcd/issues/8776), so that client can decide whether to retry or not, depending on the connection states.

Again, watch is not really meant for detecting disconnects. If you want to detect connection loss, I would try concurrency.Session with TTL. Example usage in lock command:

https://github.com/coreos/etcd/blob/7e0fc6136eb4c5dc3c8c39ff48eb29e87b7158ab/etcdctl/ctlv3/command/lock_command.go#L56

devnev commented 6 years ago

Thank you for the clarification and the workaround.

As a final note, even with multiple etcds, there can be total connection loss to the etcd cluster (in our case this was caused by some as of yet undiagnosed routing issue affecting only specific kubernetes pods). One of the benefits of watches I was expecting was to receive updates quickly, making the system more reactive. If a watch gets completely disconnected, the system loses this property. I am glad to hear there's a workaround using sessions, but I am now going to have to wrap every single one of my watches in a session to ensure the reactivity of the system.

gyuho commented 6 years ago

As a final note, even with multiple etcds, there can be total connection loss to the etcd cluster (in our case this was caused by some as of yet undiagnosed routing issue affecting only specific kubernetes pods).

Agree. We will address this in the next release cycle (hopefully v3.4). Also document the current watch limitations in 3.3 release.

xiang90 commented 6 years ago

@devnev @gyuho

An retry option should be provided to the users for watch. If the watcher cannot be reconnected for say 5 seconds, the watcher can return an error.

embano1 commented 4 years ago

Thank you for this useful discussion here. I am seeing the same behavior when simulating partitions (e.g. cut [virtual] network connection or use iptables). Even though my client had all cluster endpoints and tight timeout options configured, Watch() (even those with WithRequireLeader) would hang forever. The underlying gRPC connection remains in TRANSIENT_FAILURE and the corresponding watchChannel is not closed even though lost leader == true.

My observation is only correct when the client itself is partitioned from the majority, i.e. it cannot switch (balance) to another endpoint. This could happen when the client runs on the same etcd instance/node or in the same failure domain/zone. This is in-line with what it discussed here and an open issue according to the Watch() docs:

// TODO(v3.4): configure watch retry policy, limit maximum retry number
// (see https://github.com/coreos/etcd/issues/8980)

I see the same behavior with the etcdctl CLI. For reference, this is my client setup with the corresponding timeouts:

cli, err := clientv3.New(clientv3.Config{
        Endpoints:            []string{"etcd1:2379", "etcd2:2379", "etcd3:2379"},
        DialTimeout:          2 * time.Second,
        DialKeepAliveTime:    5 * time.Second,
        DialKeepAliveTimeout: 2 * time.Second,
        DialOptions:          []grpc.DialOption{grpc.WithBlock()},
    })

ctx, cancel := context.WithCancel(context.Background())
defer cancel()

leaderCtx := clientv3.WithRequireLeader(ctx)
rch := cli.Watch(leaderCtx, "", clientv3.WithPrefix())
[...]

It would be great to update the docs for WithRequireLeader() as it could be misread. Having a retry policy for Watch() as discussed here and mentioned in the TODO comment would be great.

For more details please also see https://github.com/kubernetes/kubernetes/issues/89488

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

chaochn47 commented 1 year ago

Somehow I was excavating the old issue, the symptom described by @embano1 was not reproducible at least on main branch. I have tested both manually e2e by injecting iptable rules across a 3 node etcd hosts with a client running remotely and locally (adhere to a server node the watch connects to).

Not to mention etcd has TestWatchWithRequireLeader and TestBalancerUnderNetworkPartitionWatchFollower to ensure watch will be cancelled with etcdserver: no leader cancel reason to educate the client to reconnect to a endpoint with quorum after around 3 seconds.

Implementation: https://github.com/etcd-io/etcd/blob/9e1abbab6e4ba2886238f49b0d48fc19a546c7cf/server/etcdserver/api/v3rpc/interceptor.go#L336-L346

However, the original issue is not resolved. If a single node is abruptly stopped, the client watch will hang forever.

An retry option should be provided to the users for watch. If the watcher cannot be reconnected for say 5 seconds, the watcher can return an error.

It looks promising and optional for user to enable.