Document recommended configuration for worker job streaming with reverse proxies

npepinpe commented 3 months ago

Description

We had two incidents now due to long living job streams being closed unexpectedly with 504 errors. This is most prominent with nginx if users configure the parameter grpc_read_timeout, as idle workers will have no "response" possibly for a long time. It seems the nginx project has no plans to forward the HTTP/2 pings, or consider those to keep the connection alive.

So the recommendation for now is the following:

On nginx, set a high grpc_read_timeout, e.g. 12h
On the client side, set the job worker streamTimeout to be a little less than the nginx timeout, e.g. 10h . It's a good idea anyway to occasionally load balance your long living streams.
Upstream failures will likely be detected as 502s due to the client sending keep alives (default interval is 45s , but it's configurable per client), and if upstream is gone then the ping will not be forwarded and the client will receive a 502, and the connection will be closed by nginx.

While this is nginx specific, it can be useful for other reverse proxies (e.g. Traeffik) which may have similar issues.

My only question here is where this kind of documentation should lived - in the self-managed operation guides (but where?), or with the job worker documentation. For this, please consult someone from the DevEx team for their opinions during kickoff.

Context

This affects us in SaaS, where the default grpc_read_timeout is 10 minutes and 1 second. Idle workers are immediately closed after this with a 504.
This was related to two support issues:
- https://jira.camunda.com/browse/SUPPORT-22069
- https://jira.camunda.com/browse/SUPPORT-22216

npepinpe commented 3 months ago

Please adjust docs if https://github.com/camunda/camunda/issues/19188 is accepted by the ZPA team accordingly to reference the new default value as well, and suggest a grpc_read_timeout (or equivalent in other ingresses) of twice that.

akeller commented 3 months ago

My only question here is where this kind of documentation should lived - in the self-managed operation guides (but where?), or with the job worker documentation. For this, please consult someone from the DevEx team for their opinions during kickoff.

🧡 Thank you! When ready, please connect with @conceptualshark, as our emerging DRI of Self-Managed docs.

npepinpe commented 3 months ago

@conceptualshark - I'm working off a draft right now to put this in self-managed as a new page under zeebe-deployment/zeebe-gateway. I would also link the respective client pages to this page of course.

I'm not sure if this is the best place though. We don't have a page about ingress config in general or reverse proxy usage either, where it could potentially also go.

akeller commented 2 months ago

@conceptualshark - I'm working off a draft right now to put this in self-managed as a new page under zeebe-deployment/zeebe-gateway. I would also link the respective client pages to this page of course.

I'm not sure if this is the best place though. We don't have a page about ingress config in general or reverse proxy usage either, where it could potentially also go.

👆 @conceptualshark ICYMI

camunda / camunda-docs

Document recommended configuration for worker job streaming with reverse proxies #3903

Description

Context