Closed arslanbekov closed 3 years ago
@Arslanbekov From #9178, it seems that gRPC returned code is unavailable, but the description is canceled. Can you open an issue in grcp-go to ask?
@xiang90
but the description is canceled
are you talking about it? https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L113
I will create issue in grpc-go
, thanks
yes. thank you.
The error string is generated from x/net/http2. The error description comes from method Error()
.
The error conversion should be handled by mapRecvMsgError
, which also calls Error()
.
Can you add a print to mapRecvMsgError
to print the error type and error details?
Something like fmt.Printf("%T %v", err, err)
. Thanks.
@Arslanbekov just touching base on this did you see @menghanl (grpc-go maintainer) comment above?
Yes, @hexfusion thanks for notifying, there is no free time to do it now. Maybe someone will help?
Yes, @hexfusion thanks for notifying, there is no free time to do it now. Maybe someone will help?
@Arslanbekov I will try to take a look soon.
@hexfusion Did you get time to try this? Thanks!
@menghanl thanks for the reminder, I will work on it before the end of the week.
@spzala could you please take a look?
@hexfusion I sure will, thanks!
@hexfusion I sure will, thanks!
Quick update - I had to spend some time creating a cluster with tls but I have it created now, so should be providing more update soon. Thanks!
So not sure if I am doing something incorrect or in a different way but I could not reproduce the issue exactly but I do see rpc error: code = Unavailable
(i.e. https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L108 instead of L113 as commented in the issue above)
This is how I am testing:
Run this on server:
./bin/etcd --data-dir=data.etcd --name ${ETCD_NAME} --advertise-client-urls ${ADVERTISE_CLIENT_URLS} --listen-client-urls ${LISTEN_CLIENT_URLS} --initial-cluster ${INITIAL_CLUSTER} --initial-cluster-state ${INITIAL_CLUSTER_STATE} --initial-cluster-token ${INITIAL_CLUSTER_TOKEN} --cert-file $CERT_FILE --trusted-ca-file $TRUSTED_CA_FILE --key-file $KEY_FILE --debug
Run below commands on client (a separate VM) on two diff cli:
./bin/etcdctl --endpoints https://<ip>:2379 watch foo --cacert /root/cfssl/ca.pem --cert /root/cfssl/client.pem --key /root/cfssl/client-key.pem
./bin/etcdctl --endpoints https://<ip>:2379 put foo bar --cacert /root/cfssl/ca.pem --cert /root/cfssl/client.pem --key /root/cfssl/client-key.pem
And then cancel the watch command (ctrl+c) which gives me following error most of the time,
2019-01-30 03:35:31.618639 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = client disconnected”)
Also, @menghanl thanks but I couldn't see the print output in the etcd log after I add the line to mapRecvMsgError
. Is it logged somewhere other than etcd server output? I also tried adding GRPC_GO_LOG_VERBOSITY_LEVEL=2
and other flags in env. Since I couldn't reproduce issue as explained above, I also tried couple of other Println in different package functions just to play around but didn't see anything in the log. Thanks!
@spzala
Re logging:
Did you call grpclog.Info
(or other grpclog
functions)? If so:
GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info
should print everything.SetLogger
or SetLoggerV2
), then you need to configure the custom logger to log everything.If you called fmt.Print
, it's logged to stdout
, and gRPC doesn't do anything special to that.
The function (mapRecvMsgError
) is just where I guess the problem is. If nothing is printed, it's totally possible that the bug happens in another codepath. But we can only confirm that when we reproduce the bug.
In the reproducing command, what should be the values of the configs (like ${ADVERTISE_CLIENT_URLS}
)? I will try to run it myself, but I'm not very familiar with etcd. Thanks!
@menghanl thanks much for the great detail!! and yes etcds uses SetLoggerV2
so that might be the reason. I will run again modifying it not to override default grpclog. Also for testing I used both, fmt.Printf and fmt.Println so as you guessed the problem might be something else but I will try again with grpclogger.
About the etcd server flags and their values - for the ones that I used here, I am copying below but you can find those and more flags at https://github.com/etcd-io/etcd/blob/master/etcdmain/help.go. With tls enabled, I have used https instead of http e.g. ADVERTISE_CLIENT_URLS=${HOST}:2379 where HOST is https://<fqdn>
--advertise-client-urls 'http://localhost:2379'
List of this member's client URLs to advertise to the public.
--listen-client-urls 'http://localhost:2379'
List of URLs to listen on for client traffic.
--initial-cluster 'default=http://localhost:2380'
Initial cluster configuration for bootstrapping.
--initial-cluster-state 'new'
Initial cluster state ('new' or 'existing').
--initial-cluster-token 'etcd-cluster'
Initial cluster token for the etcd cluster during bootstrap.
Specifying this can protect you from unintended cross-cluster interaction when running multiple clusters.
--cert-file ''
Path to the client server TLS cert file.
--key-file ''
Path to the client server TLS key file.
For the etcd client - https://github.com/etcd-io/etcd/tree/master/etcdctl
--cacert=/tmp/ca.pem
verify certificates of TLS-enabled secure servers using this CA bundle
--cert=/tmp/cert.pem
identify secure client using this TLS certificate file
-key=/tmp/key.pem
identify secure client using this TLS key file
cc @jpbetz
Thanks for digging @spzala I am going to dig on this further.
Thanks @hexfusion !!
Update: @menghanl was right about the source of the issue the actual inbound error to mapRecvMsgError is.
rpc error: code = Unknown desc = client disconnected
I will connect the rest of the dots and come up with some resolution hopefully soon.
I have thought about this for a while and although it does feel like there could be a middle ground between Unknown
and Unavailable
error codes in this situation. For example by allowing Unknown
to conditionally remain as the error code similar to [1] vs forcing Unavailable
. If we look at the gRPC spec [1] the client generated Unavailable
seems sane vs server. So perhaps this is not a matter of using the wrong error code but a result of gRPC having 2 server transports [3]? @menghanl curious your thoughts?
/cc @brancz
Some data transmitted (e.g., request metadata written to TCP connection) before connection breaks | UNAVAILABLE | Client |
---|
Server shutting down ...................................................................................... | UNAVAILABLE | Server |
---|
[1]https://github.com/grpc/grpc-go/blob/66cd5249103cfe4638864f08462537504b07ff1d/internal/transport/handler_server.go#L426 [2] https://github.com/grpc/grpc/blob/master/doc/statuscodes.md [3] https://github.com/grpc/grpc-go/issues/1819#issuecomment-360648206 (won't fix)
@hexfusion Thanks for the findings!
actual inbound error to mapRecvMsgError is.
rpc error: code = Unknown desc = client disconnected
Do you know where this error is from? This looks like an error returned from a service handler with fmt.Errorf
, instead of status.Errorf
with an error code.
Also, the error reported originally is rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL
, which has different description.
I think the issue raised here is not about which error code to return.
It more about the confusion caused by error code Unavailable
while the error description says CANCEL
.
What we want to figure out is why (how) rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL
is generated.
@menghanl thanks for the notes, your questions are valid, I will prioritize some time to get you those answers.
@hexfusion Is there any progress about this issue ? Certain etcd3 alert rules are being triggered because of this issue https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/etcd3_alert.rules.yml#L32 ?
@hexfusion Is there any progress about this issue ? Certain etcd3 alert rules are being triggered because of this issue https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/etcd3_alert.rules.yml#L32 ?
Any update or plan to fix it?
FYI, for those using the etcd mixin, I worked around this by tweaking the etcd selector:
local etcd = import "etcd-mixin/mixin.libsonnet";
etcd {
_config+:: {
etcd_selector: 'job="etcd",grpc_service!="etcdserverpb.Watch"'
},
}
I am picking this back up now, hopefully for the last time to resolve.
I have found the source of Unavailable
instead of Canceled
.
https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/watch.go#L198-L201
When the watch is canceled we pass error
Which is not valid resulting in codes.Unavailable
passed to status.New() leading to metrics reporting Unavailable
. In my local testing adding a new error using codes.Canceled
allows the metrics to properly report.
I will raise a PR in a bit to resolve.
Any news on merging fixes?!
This could solve some ongoing issues with Prometheus monitoring, will this PR be ever merged? 😃
After this PR https://github.com/etcd-io/etcd/pull/9178 we get:
It's in etcd running with parameter
--log-package-levels=debug
:in my prometheus i see:
But it seems to me, this not error, and should not fall under Unavailable Code
It looks like Canceled Code
Now everyone who uses prometheus operator + alertmanager, get this alert, because CANCEL falls under Unavailable
9725, #9576, #9166 https://github.com/openshift/origin/issues/20311, https://github.com/coreos/prometheus-operator/issues/2046, and in google groups