Better recovery from unhealthy controller connection?

ruimarinho commented 6 months ago

Hi,

Recently, our workers started becoming stuck after the keepalive ping fails due to ACK timeout. This coincides with a migration on our side from an AWS Classic Load Balancer to a Network Load Balancer. There's something about that change that orchard is more sensitive too.

I'm wondering if this behavior can be improved to automatically restart the gRPC stream in case it fails to actually communicate? I'm assuming there is an issue preventing the retry logic from actually working since a simple orchard restart immediately resolves the problem and there is no apparent loss of network connection at any stafe.

{"level":"info","ts":1706057743.510983,"msg":"syncing 0 local VMs against 0 remote VMs..."}
{"level":"debug","ts":1706057748.3017972,"msg":"got worker from the API"}
{"level":"debug","ts":1706057748.3993149,"msg":"updated worker in the API"}
{"level":"info","ts":1706057748.5070028,"msg":"syncing 0 local VMs against 0 remote VMs..."}
{"level":"warn","ts":1706057773.640198,"msg":"failed to watch RPC: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"}
{"level":"info","ts":1706057773.9021702,"msg":"connecting to orchard-controller:443 over gRPC"}
{"level":"info","ts":1706057773.903039,"msg":"gRPC connection established, starting gRPC stream with the controller"}
{"level":"info","ts":1706057774.88195,"msg":"running gRPC stream with the controller"}

That last message remains there at the log until a restart is performed, but the orchard controller does not receive any ping from the worker since that point in time (inclusively).

A typical successful log message group would show:

{"level":"info","ts":1706057774.88195,"msg":"running gRPC stream with the controller"}
{"level":"info","ts":1706099569.125219,"msg":"registered worker ci-1"}
{"level":"info","ts":1706099569.1264691,"msg":"connecting to orchard-controller:443 over gRPC"}
{"level":"info","ts":1706099569.127172,"msg":"gRPC connection established, starting gRPC stream with the controller"}
{"level":"info","ts":1706099569.243507,"msg":"syncing on-disk VMs..."}
{"level":"debug","ts":1706099569.2437172,"msg":"running 'tart list --format json'"}

Thank you!

ruimarinho commented 6 months ago

Initial observation looking at metrics is that orchard is not recovering from TCP RST packets sent by the NLB.

fkorotkov commented 6 months ago

We are also investigating switching from grpc-go with its own networking to https://connectrpc.com/ which is based on the standard Go's HTTP client and provides GPRC compatibility.

edigaryev commented 5 months ago

Hello Rui 👋

I've tried reproducing your issue by running an Orchard Controller on an EC2 instance behind an NLB and I think that the issue is not with the gRPC stream because it works just fine in the presence of hard-coded 350 second timeout imposed by the NLB.

There's a great article gRPC Keepalives and Transport Closing Errors on AWS Network Load Balancers which discusses communication problems with NLB a gRPC server/client pair that have improper keep-alive settings and how it can be fixed, however, this doesn't seem to be our case as we're actually (1) using streams and (2) have keep-alive configured on the server too (which should not result in GO_AWAY messages.

Looking at the last log lines before that you've posted before the worker had stuck:

{"level":"info","ts":1706057773.903039,"msg":"gRPC connection established, starting gRPC stream with the controller"}
{"level":"info","ts":1706057774.88195,"msg":"running gRPC stream with the controller"}

The next lines should be got worker from the API, updated worker in the API and so on.

The only reason for these lines not to be emitted I see is that Orchard's HTTP client is not making any progress, which should be addressed by https://github.com/cirruslabs/orchard/pull/153.

edigaryev commented 5 months ago

Please check out the new 0.15.1 release, it should help with your issue.

Closing for now, please let us know in case you'll encounter this ever again.

cirruslabs / orchard

Better recovery from unhealthy controller connection? #152