Closed npepinpe closed 2 months ago
Please adjust docs if https://github.com/camunda/camunda/issues/19188 is accepted by the ZPA team accordingly to reference the new default value as well, and suggest a grpc_read_timeout
(or equivalent in other ingresses) of twice that.
My only question here is where this kind of documentation should lived - in the self-managed operation guides (but where?), or with the job worker documentation. For this, please consult someone from the DevEx team for their opinions during kickoff.
🧡 Thank you! When ready, please connect with @conceptualshark, as our emerging DRI of Self-Managed docs.
@conceptualshark - I'm working off a draft right now to put this in self-managed as a new page under zeebe-deployment/zeebe-gateway
. I would also link the respective client pages to this page of course.
I'm not sure if this is the best place though. We don't have a page about ingress config in general or reverse proxy usage either, where it could potentially also go.
@conceptualshark - I'm working off a draft right now to put this in self-managed as a new page under
zeebe-deployment/zeebe-gateway
. I would also link the respective client pages to this page of course.I'm not sure if this is the best place though. We don't have a page about ingress config in general or reverse proxy usage either, where it could potentially also go.
👆 @conceptualshark ICYMI
Description
We had two incidents now due to long living job streams being closed unexpectedly with 504 errors. This is most prominent with nginx if users configure the parameter
grpc_read_timeout
, as idle workers will have no "response" possibly for a long time. It seems the nginx project has no plans to forward the HTTP/2 pings, or consider those to keep the connection alive.So the recommendation for now is the following:
grpc_read_timeout
, e.g.12h
streamTimeout
to be a little less than the nginx timeout, e.g.10h
. It's a good idea anyway to occasionally load balance your long living streams.45s
, but it's configurable per client), and if upstream is gone then the ping will not be forwarded and the client will receive a 502, and the connection will be closed by nginx.While this is nginx specific, it can be useful for other reverse proxies (e.g. Traeffik) which may have similar issues.
My only question here is where this kind of documentation should lived - in the self-managed operation guides (but where?), or with the job worker documentation. For this, please consult someone from the DevEx team for their opinions during kickoff.
Context
grpc_read_timeout
is 10 minutes and 1 second. Idle workers are immediately closed after this with a 504.