etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.84k stars 9.77k forks source link

gRPC code Unavailable instead Canceled #10289

Closed arslanbekov closed 3 years ago

arslanbekov commented 5 years ago

After this PR https://github.com/etcd-io/etcd/pull/9178 we get:

It's in etcd running with parameter --log-package-levels=debug:

...
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL")
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 78; CANCEL")
...

in my prometheus i see:

sum by(grpc_type, grpc_service, grpc_code) (grpc_server_handled_total{grpc_code="Unavailable", job="etcd", grpc_type="bidi_stream"}) = 4172

But it seems to me, this not error, and should not fall under Unavailable Code

Unavailable indicates the service is currently unavailable. This is a most likely a transient condition and may be corrected by retrying with a backoff. See litmus test above for deciding between FailedPrecondition, Aborted, and Unavailable.

It looks like Canceled Code

Canceled indicates the operation was canceled (typically by the caller).

Now everyone who uses prometheus operator + alertmanager, get this alert, because CANCEL falls under Unavailable

9725, #9576, #9166 https://github.com/openshift/origin/issues/20311, https://github.com/coreos/prometheus-operator/issues/2046, and in google groups

xiang90 commented 5 years ago

@Arslanbekov From #9178, it seems that gRPC returned code is unavailable, but the description is canceled. Can you open an issue in grcp-go to ask?

arslanbekov commented 5 years ago

@xiang90

but the description is canceled

are you talking about it? https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L113

I will create issue in grpc-go, thanks

xiang90 commented 5 years ago

yes. thank you.

menghanl commented 5 years ago

The error string is generated from x/net/http2. The error description comes from method Error().

The error conversion should be handled by mapRecvMsgError, which also calls Error().

Can you add a print to mapRecvMsgError to print the error type and error details? Something like fmt.Printf("%T %v", err, err). Thanks.

hexfusion commented 5 years ago

@Arslanbekov just touching base on this did you see @menghanl (grpc-go maintainer) comment above?

arslanbekov commented 5 years ago

Yes, @hexfusion thanks for notifying, there is no free time to do it now. Maybe someone will help?

hexfusion commented 5 years ago

Yes, @hexfusion thanks for notifying, there is no free time to do it now. Maybe someone will help?

@Arslanbekov I will try to take a look soon.

menghanl commented 5 years ago

@hexfusion Did you get time to try this? Thanks!

hexfusion commented 5 years ago

@menghanl thanks for the reminder, I will work on it before the end of the week.

hexfusion commented 5 years ago

@spzala could you please take a look?

spzala commented 5 years ago

@hexfusion I sure will, thanks!

spzala commented 5 years ago

@hexfusion I sure will, thanks!

Quick update - I had to spend some time creating a cluster with tls but I have it created now, so should be providing more update soon. Thanks!

spzala commented 5 years ago

So not sure if I am doing something incorrect or in a different way but I could not reproduce the issue exactly but I do see rpc error: code = Unavailable (i.e. https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L108 instead of L113 as commented in the issue above) This is how I am testing: Run this on server: ./bin/etcd --data-dir=data.etcd --name ${ETCD_NAME} --advertise-client-urls ${ADVERTISE_CLIENT_URLS} --listen-client-urls ${LISTEN_CLIENT_URLS} --initial-cluster ${INITIAL_CLUSTER} --initial-cluster-state ${INITIAL_CLUSTER_STATE} --initial-cluster-token ${INITIAL_CLUSTER_TOKEN} --cert-file $CERT_FILE --trusted-ca-file $TRUSTED_CA_FILE --key-file $KEY_FILE --debug Run below commands on client (a separate VM) on two diff cli: ./bin/etcdctl --endpoints https://<ip>:2379 watch foo --cacert /root/cfssl/ca.pem --cert /root/cfssl/client.pem --key /root/cfssl/client-key.pem ./bin/etcdctl --endpoints https://<ip>:2379 put foo bar --cacert /root/cfssl/ca.pem --cert /root/cfssl/client.pem --key /root/cfssl/client-key.pem And then cancel the watch command (ctrl+c) which gives me following error most of the time, 2019-01-30 03:35:31.618639 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = client disconnected”)

Also, @menghanl thanks but I couldn't see the print output in the etcd log after I add the line to mapRecvMsgError. Is it logged somewhere other than etcd server output? I also tried adding GRPC_GO_LOG_VERBOSITY_LEVEL=2 and other flags in env. Since I couldn't reproduce issue as explained above, I also tried couple of other Println in different package functions just to play around but didn't see anything in the log. Thanks!

menghanl commented 5 years ago

@spzala Re logging: Did you call grpclog.Info (or other grpclog functions)? If so:

If you called fmt.Print, it's logged to stdout, and gRPC doesn't do anything special to that.


The function (mapRecvMsgError) is just where I guess the problem is. If nothing is printed, it's totally possible that the bug happens in another codepath. But we can only confirm that when we reproduce the bug.

In the reproducing command, what should be the values of the configs (like ${ADVERTISE_CLIENT_URLS})? I will try to run it myself, but I'm not very familiar with etcd. Thanks!

spzala commented 5 years ago

@menghanl thanks much for the great detail!! and yes etcds uses SetLoggerV2 so that might be the reason. I will run again modifying it not to override default grpclog. Also for testing I used both, fmt.Printf and fmt.Println so as you guessed the problem might be something else but I will try again with grpclogger.

About the etcd server flags and their values - for the ones that I used here, I am copying below but you can find those and more flags at https://github.com/etcd-io/etcd/blob/master/etcdmain/help.go. With tls enabled, I have used https instead of http e.g. ADVERTISE_CLIENT_URLS=${HOST}:2379 where HOST is https://<fqdn>

--advertise-client-urls 'http://localhost:2379'
List of this member's client URLs to advertise to the public.
--listen-client-urls 'http://localhost:2379'
List of URLs to listen on for client traffic.
--initial-cluster 'default=http://localhost:2380'
Initial cluster configuration for bootstrapping.
--initial-cluster-state 'new'
Initial cluster state ('new' or 'existing').
--initial-cluster-token 'etcd-cluster'
Initial cluster token for the etcd cluster during bootstrap.
Specifying this can protect you from unintended cross-cluster interaction when running multiple clusters.
--cert-file ''
Path to the client server TLS cert file.
--key-file ''
Path to the client server TLS key file.

For the etcd client - https://github.com/etcd-io/etcd/tree/master/etcdctl

--cacert=/tmp/ca.pem            
verify certificates of TLS-enabled secure servers using this CA bundle
--cert=/tmp/cert.pem                
identify secure client using this TLS certificate file
-key=/tmp/key.pem
identify secure client using this TLS key file
jingyih commented 5 years ago

cc @jpbetz

hexfusion commented 5 years ago

Thanks for digging @spzala I am going to dig on this further.

spzala commented 5 years ago

Thanks @hexfusion !!

hexfusion commented 5 years ago

Update: @menghanl was right about the source of the issue the actual inbound error to mapRecvMsgError is.

rpc error: code = Unknown desc = client disconnected

I will connect the rest of the dots and come up with some resolution hopefully soon.

hexfusion commented 5 years ago

I have thought about this for a while and although it does feel like there could be a middle ground between Unknown and Unavailable error codes in this situation. For example by allowing Unknown to conditionally remain as the error code similar to [1] vs forcing Unavailable. If we look at the gRPC spec [1] the client generated Unavailable seems sane vs server. So perhaps this is not a matter of using the wrong error code but a result of gRPC having 2 server transports [3]? @menghanl curious your thoughts?

/cc @brancz

Some data transmitted (e.g., request metadata written to TCP connection) before connection breaks UNAVAILABLE Client
Server shutting down ...................................................................................... UNAVAILABLE Server

[1]https://github.com/grpc/grpc-go/blob/66cd5249103cfe4638864f08462537504b07ff1d/internal/transport/handler_server.go#L426 [2] https://github.com/grpc/grpc/blob/master/doc/statuscodes.md [3] https://github.com/grpc/grpc-go/issues/1819#issuecomment-360648206 (won't fix)

menghanl commented 5 years ago

@hexfusion Thanks for the findings!

actual inbound error to mapRecvMsgError is. rpc error: code = Unknown desc = client disconnected

Do you know where this error is from? This looks like an error returned from a service handler with fmt.Errorf, instead of status.Errorf with an error code. Also, the error reported originally is rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL, which has different description.


I think the issue raised here is not about which error code to return. It more about the confusion caused by error code Unavailable while the error description says CANCEL.

What we want to figure out is why (how) rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL is generated.

hexfusion commented 5 years ago

@menghanl thanks for the notes, your questions are valid, I will prioritize some time to get you those answers.

hectorj2f commented 5 years ago

@hexfusion Is there any progress about this issue ? Certain etcd3 alert rules are being triggered because of this issue https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/etcd3_alert.rules.yml#L32 ?

darwink commented 5 years ago

@hexfusion Is there any progress about this issue ? Certain etcd3 alert rules are being triggered because of this issue https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/etcd3_alert.rules.yml#L32 ?

Any update or plan to fix it?

siwyd commented 5 years ago

FYI, for those using the etcd mixin, I worked around this by tweaking the etcd selector:

        local etcd = import "etcd-mixin/mixin.libsonnet";

        etcd {
          _config+:: {
            etcd_selector: 'job="etcd",grpc_service!="etcdserverpb.Watch"'
          },
        }
hexfusion commented 5 years ago

I am picking this back up now, hopefully for the last time to resolve.

hexfusion commented 4 years ago

I have found the source of Unavailable instead of Canceled.

https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/watch.go#L198-L201

When the watch is canceled we pass error

https://github.com/etcd-io/etcd/blob/63dd73c1869f1784f907b922f61571176a2802e8/etcdserver/api/v3rpc/rpctypes/error.go#L66

Which is not valid resulting in codes.Unavailable passed to status.New() leading to metrics reporting Unavailable. In my local testing adding a new error using codes.Canceled allows the metrics to properly report.

I will raise a PR in a bit to resolve.

imunhatep commented 4 years ago

Any news on merging fixes?!

immanuelfodor commented 3 years ago

This could solve some ongoing issues with Prometheus monitoring, will this PR be ever merged? 😃