Closed jwulf closed 4 years ago
Running the 0.21.3 branch with the heartbeat code.
Connected fine. Ran at DEBUG log level for several hours, then switched to INFO to surface errors.
Load on the broker is next to nothing. Workers polling at 30 second intervals. ~12 workers connected.
[2019 Oct-29 09:49:22AM] ERROR:
context: "/server/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:71"
id: "gRPC Channel"
message: "GRPC ERROR: 1 CANCELLED: Received http2 header with status: 503"
pollMode: "Long Poll"
taskType: "gRPC Channel"
[2019 Oct-29 09:49:22AM] ERROR:
context: "/server/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:242"
id: "gRPC Channel"
message: "GRPC Channel State: READY"
pollMode: "Long Poll"
taskType: "gRPC Channel"
When the cluster reschedule finished, the client came back up, and serviced tasks.
10:06:34: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received
three times
10:10:36: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received
once.
10:29:15: Worker still working.
10:43:43: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received
once.
10:43:48: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received
once.
10:52:53: Worker still working.
11:06:32: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received
once.
11:10:19: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received
once.
11:10:33: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received
twice.
11:29:01: Worker still working.
11:38:33: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received
twice.
11:47:57: Worker still working.
12:06:37: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received
twice.
12:10:12: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received
twice.
The 12:06 and 12:10 events were predicted, so there is a pattern.
12:17:00 Restarted workers to determine if the pattern is absolute timed (server-dependent) or relative timed (client-elapsed-time-dependent)
12:43:41: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received
five times.
Running the client against a local broker running in Docker, against a remote broker in the same data centre (AWS US East), or against a K8s cluster via port-forwarding (AWS US East to GKE AU South East), there are none of these errors.
Therefore it is either the proxy or the broker config on Camunda Cloud.
It's still disconnecting and requiring a reboot. Try this: if the channel is down for a set amount of time, destroy and recreate the channel.
It looks like the worker channels are durable, but the client channel used to send commands becomes stale.
At the moment you can't reliably inspect the state of the client channel because the worker channel state bubbles up through it. Will change that behaviour in #109.
Any news on this?
Are you seeing this issue in production? I would be surprised if you see it with Lambdas. It seems to affect long-lived connections.
The Camunda Cloud team are now using this in production for their own systems, and are looking into the source of these issues.
Looked into this today with @colrad. We believe nginx receives the keepalive but doesn't pass it to the backend. Because the nginx <-> backend connection has no data on it, it's killed after 60s by the grpc_read_timeout
. We've increased that timeout to 601s (1s > than the default long-poll of 10 mins) in https://github.com/camunda-cloud/zeebe-controller-k8s/pull/185 - this should go live this week
https://trac.nginx.org/nginx/ticket/1555 sound familiar? we are playing around with tuning http2_max_requests but this comes with some caveats (basically it's a cap on a memory leak).
any chance you could put a debug counter on the requests?
Fixed in 0.23.0.
@jwulf thanks for the fix! what was the root cause?
Since cloud went to Zeebe 0.21.1, this happens every day:
The client reports it is connected, but does not retrieve any jobs.
Could this be due to pod rescheduling?