etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.83k stars 9.77k forks source link

EtcdError is missing GRPCStatus that causes grpc.status.FromError to always return codes.Unknown #15539

Open lavacat opened 1 year ago

lavacat commented 1 year ago

What happened?

I've discovered this issue when debugging watch not retrying in openWatchClient

isHaltErr was returning true on ErrLeaderChanged

This is only applicable when auth is enabled. w.remote.Watch(w.ctx, w.callOpts...) -> streamClientInterceptor-> getToken -> Auth.Authenticate -> toErr toErr will convert ErrGRPCLeaderChanged to ErrLeaderChanged, isHaltErr(ErrLeaderChanged) returns true

When auth is disabled, Auth.Authenticate isn't called and toErr conversion doesn't happen.

What did you expect to happen?

expected watch to retry

How can we reproduce it (as minimally and precisely as possible)?

See unit and integration test in the PR.

Note, test is using ErrGRPCNoLeader, but in production we've observed ErrLeaderChanged.

Anything else we need to know?

Another side effect of this issue is that prometheus interceptor will only increment Unknown grpc code in metrics on error.

I suspect this might also affect behavior of retry_interceptor and client/v3/leasing that retries on transient error.

Etcd version (please run commands below)

etcd 3.4, 3.5 and main

```console $ etcd --version etcd Version: 3.6.0-alpha.0 Git SHA: f43dccb2f Go Version: go1.19.2 Go OS/Arch: darwin/amd64 $ etcdctl version etcdctl version: 3.6.0-alpha.0 API version: 3.6 ```

Etcd configuration (command line flags or environment variables)

# paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

```console $ etcdctl member list -w table # paste output here $ etcdctl --endpoints= endpoint status -w table # paste output here ```

Relevant log output

No response

lavacat commented 1 year ago

this is related to auth implementation cc @mitake @ahrtr

serathius commented 1 year ago

cc @ptabor

mitake commented 1 year ago

Thanks @lavacat , probably I'll be able to check sometime Wednesday, sorry for keeping you waiting.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.