fullstorydev / grpcurl

Like cURL, but for gRPC: Command-line tool for interacting with gRPC servers
MIT License
10.36k stars 497 forks source link

grpcurl fails with "context deadline exceeded" after 10s if using plaintext when server expects TLS #387

Open ucarion opened 1 year ago

ucarion commented 1 year ago

Bottom line up front, here's how you reproduce this issue:

$ grpcurl -version
grpcurl 1.8.6

$ time grpcurl -plaintext grpcb.in:9001 list
Failed to dial target host "grpcb.in:9001": context deadline exceeded
grpcurl -plaintext grpcb.in:9001 list  0.02s user 0.03s system 0% cpu 10.082 total

For context, grpcb.in:9001 wants TLS; -plaintext is the problem. But the fact that grpcurl hangs for 10 seconds, and does not produce an informative error, is the subject of this GitHub issue. I suspect the issue may be use of grpc.WithBlock() prevents an error from bubbling up, but I assume there's a good reason for the use of that dialopt for some other purpose.

jhump commented 1 year ago

I suspect the issue may be use of grpc.WithBlock() prevents an error from bubbling up,

I doubt that. That's actually how you get any error to bubble up. Otherwise, you never get any sort of feedback from Dial as it does the actual TCP connection setup completely asynchronous and only returns an error if there is some other configuration problem with the options.

The issue here is where it fails. In grpcurl.BlockingDial, we try to control both dialing and a potential TLS handshake so that we can intercept any errors (which the underlying gRPC Go runtime library hides from the application), in order to give a decent error to the user.

The issue here is actually that the connections are setup just fine -- all a plaintext connection cares about is getting the TCP connection. The other direction (using TLS in the client to a server that does not expect it) fails more cleanly because the error does bubble up from dialing because the connections cannot be established because the TLS handshake fails.

So the actual error is happening inside the gRPC runtime when it tries to send the HTTP/2 preface to the server. In this case, the server is expecting a TLS handshake, but doesn't receive one. So the server immediately closes the connection. We're providing a grpc. FailOnNonTempDialError(true) dial option, in the hopes that something like this would be bubbled up from the dial call. But apparently the server suddenly closing the connection (without any known reason) is interpreted as a temporary error. So the runtime keeps re-trying, creating a new connection over and over, never getting a healthy one that can be used for sending an RPC.

A fix is possible, but it isn't simple. The custom dialer in grpc.BlockingDial will need to wrap the returned net.Conn so it has more visibility into connection closures. So it could (for example) fail fast if it sees repeated inexplicable hang-ups from the server all before the grpc.Dial call completes (and it would have some sort of error to report, likely just "connection closed by peer").

ucarion commented 1 year ago

The presence of a custom dialer does make things more unique here. In the past, I've just used the default dialer and matched against the returned error message, but I presume the custom dialer must remain as-is for other reasons.

jhump commented 1 year ago

I've just used the default dialer and matched against the returned error message

The custom dialer is actually only here to provide decent error messages. The "context deadline exceeded" error is what is coming from the grpc.Dial call, so "matched against the returned error message" wouldn't really help here. The custom dialers are only in place to intercept underlying network errors, so that we can use them to provide better error messages. The specific issue here is that the dialer is not instrumented to intercept all network errors -- we're missing out on whatever error is occurring after the connection is established, due to the server immediately closing the connection.

ucarion commented 1 year ago

Yeah, sorry, I misspoke -- in the past I've matched against the RPC call error, rather than the dial error, for this situation. Whether an error is from dialing versus calling an RPC has always been confusing to me, and I suspect it's not even something stable across grpc-go versions.

anitgandhi commented 7 months ago

we often ran into this problem with grpc-go clients, and the newer WithReturnConnectionError dial option is a nice alternative to WithBlock and FailOnNonTempDialError, because it bubbles up the underlying connection error. combined with some other recent improvements to the grpc-go client (i believe in v1.54.x), TLS handshake errors also show up now.

kumarniraj01 commented 5 months ago

When attempting to use grpcurl to access a service deployed on an EC2 instance through a load balancer and target, using the following command: grpcurl -plaintext test.dev.xyz:9090 list, I encounter an error. The error message states: "Failed to dial target host 'test.dev.xyz:9090': context deadline exceeded."

can anyone help me to resolve this ?

hayyaun commented 5 months ago

we often ran into this problem with grpc-go clients, and the newer WithReturnConnectionError dial option is a nice alternative to WithBlock and FailOnNonTempDialError, because it bubbles up the underlying connection error. combined with some other recent improvements to the grpc-go client (i believe in v1.54.x), TLS handshake errors also show up now.

This answer saved my day, thank you. In my case the error was made because of cert expiration, and I couldn't even retrieve it correctly, WithBlock simply stops in case of any error.