Closed ruimarinho closed 5 months ago
Initial observation looking at metrics is that orchard is not recovering from TCP RST
packets sent by the NLB.
We are also investigating switching from grpc-go
with its own networking to https://connectrpc.com/ which is based on the standard Go's HTTP client and provides GPRC compatibility.
Hello Rui 👋
I've tried reproducing your issue by running an Orchard Controller on an EC2 instance behind an NLB and I think that the issue is not with the gRPC stream because it works just fine in the presence of hard-coded 350 second timeout imposed by the NLB.
There's a great article gRPC Keepalives and Transport Closing Errors on AWS Network Load Balancers which discusses communication problems with NLB a gRPC server/client pair that have improper keep-alive settings and how it can be fixed, however, this doesn't seem to be our case as we're actually (1) using streams and (2) have keep-alive configured on the server too (which should not result in GO_AWAY
messages.
Looking at the last log lines before that you've posted before the worker had stuck:
{"level":"info","ts":1706057773.903039,"msg":"gRPC connection established, starting gRPC stream with the controller"}
{"level":"info","ts":1706057774.88195,"msg":"running gRPC stream with the controller"}
The next lines should be got worker from the API
, updated worker in the API
and so on.
The only reason for these lines not to be emitted I see is that Orchard's HTTP client is not making any progress, which should be addressed by https://github.com/cirruslabs/orchard/pull/153.
Please check out the new 0.15.1
release, it should help with your issue.
Closing for now, please let us know in case you'll encounter this ever again.
Hi,
Recently, our workers started becoming stuck after the keepalive ping fails due to ACK timeout. This coincides with a migration on our side from an AWS Classic Load Balancer to a Network Load Balancer. There's something about that change that orchard is more sensitive too.
I'm wondering if this behavior can be improved to automatically restart the gRPC stream in case it fails to actually communicate? I'm assuming there is an issue preventing the retry logic from actually working since a simple orchard restart immediately resolves the problem and there is no apparent loss of network connection at any stafe.
That last message remains there at the log until a restart is performed, but the orchard controller does not receive any ping from the worker since that point in time (inclusively).
A typical successful log message group would show:
Thank you!