Disconnects from cloud - Githubissues

jwulf commented 4 years ago

Since cloud went to Zeebe 0.21.1, this happens every day:

{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/zb/ZBWorker.js:122","id":"retrieve-spells_103","level":50,"message":"Stalled on gRPC error","pollMode":"Long Poll","taskType":"retrieve-spells","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.672Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:71","id":"gRPC Channel","level":50,"message":"GRPC ERROR: 1 CANCELLED: Received http2 header with status: 502","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.672Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:242","id":"gRPC Channel","level":50,"message":"GRPC Channel State: READY","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.673Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:77","id":"gRPC Channel","level":30,"message":"gRPC Channel reconnected","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.673Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:258","id":"gRPC Channel","level":30,"message":"gRPC Channel State: READY","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.855Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:259","id":"gRPC Channel","level":30,"message":"gRPC Retry count: 2","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.855Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:262","id":"gRPC Channel","level":30,"message":"gRPC reconnected","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.855Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:77","id":"gRPC Channel","level":30,"message":"gRPC Channel reconnected","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.856Z"}

The client reports it is connected, but does not retrieve any jobs.

Could this be due to pod rescheduling?

jwulf commented 4 years ago

Running the 0.21.3 branch with the heartbeat code.

Connected fine. Ran at DEBUG log level for several hours, then switched to INFO to surface errors.

Load on the broker is next to nothing. Workers polling at 30 second intervals. ~12 workers connected.

09:49 UTC: The cluster rescheduled while I was watching the log stream. The client failed for a minute or so with a 503 response from the nginx proxy:

[2019 Oct-29 09:49:22AM] ERROR:
    context: "/server/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:71"
    id: "gRPC Channel"
    message: "GRPC ERROR: 1 CANCELLED: Received http2 header with status: 503"
    pollMode: "Long Poll"
    taskType: "gRPC Channel"
[2019 Oct-29 09:49:22AM] ERROR:
    context: "/server/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:242"
    id: "gRPC Channel"
    message: "GRPC Channel State: READY"
    pollMode: "Long Poll"
    taskType: "gRPC Channel"

When the cluster reschedule finished, the client came back up, and serviced tasks.

10:06:34: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received three times
10:10:36: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.
10:29:15: Worker still working.
10:43:43: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.
10:43:48: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.
10:52:53: Worker still working.
11:06:32: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.
11:10:19: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.
11:10:33: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.
11:29:01: Worker still working.
11:38:33: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.
11:47:57: Worker still working.
12:06:37: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.
12:10:12: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.

The 12:06 and 12:10 events were predicted, so there is a pattern.

12:17:00 Restarted workers to determine if the pattern is absolute timed (server-dependent) or relative timed (client-elapsed-time-dependent)
12:43:41: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received five times.

jwulf commented 4 years ago

Running the client against a local broker running in Docker, against a remote broker in the same data centre (AWS US East), or against a K8s cluster via port-forwarding (AWS US East to GKE AU South East), there are none of these errors.

Therefore it is either the proxy or the broker config on Camunda Cloud.

jwulf commented 4 years ago

It's still disconnecting and requiring a reboot. Try this: if the channel is down for a set amount of time, destroy and recreate the channel.

jwulf commented 4 years ago

It looks like the worker channels are durable, but the client channel used to send commands becomes stale.

At the moment you can't reliably inspect the state of the client channel because the worker channel state bubbles up through it. Will change that behaviour in #109.

MainAero commented 4 years ago

Any news on this?

jwulf commented 4 years ago

Are you seeing this issue in production? I would be surprised if you see it with Lambdas. It seems to affect long-lived connections.

jwulf commented 4 years ago

The Camunda Cloud team are now using this in production for their own systems, and are looking into the source of these issues.

afirth commented 4 years ago

Looked into this today with @colrad. We believe nginx receives the keepalive but doesn't pass it to the backend. Because the nginx <-> backend connection has no data on it, it's killed after 60s by the grpc_read_timeout. We've increased that timeout to 601s (1s > than the default long-poll of 10 mins) in https://github.com/camunda-cloud/zeebe-controller-k8s/pull/185 - this should go live this week

afirth commented 4 years ago

https://trac.nginx.org/nginx/ticket/1555 sound familiar? we are playing around with tuning http2_max_requests but this comes with some caveats (basically it's a cap on a memory leak).

afirth commented 4 years ago

any chance you could put a debug counter on the requests?

jwulf commented 4 years ago

Fixed in 0.23.0.

afirth commented 4 years ago

@jwulf thanks for the fix! what was the root cause?

camunda-community-hub / zeebe-client-node-js

Disconnects from cloud #99