Suggestion: (optionally) separate HTTP and gRPC service ports

Hi @edigaryev,

This is a follow up to https://github.com/cirruslabs/orchard/pull/90 and https://github.com/cirruslabs/orchard/pull/93 which made it possible to use orchard behind an nginx ingress in Kubernetes.

Just to recap, the orchard client started failing to forward ports once it was put behind an nginx ingress. Initially, I though it could be related to the lack of certificate when using certificate-less bootstrap tokens, but @edigaryev was spot on that the missing piece of the puzzle was enabling the GRPCS backend for the /Controller endpoint on nginx ingress which proxies traffic to the orchard controller¹.

This solution worked intermittently, resulting in error messages from nginx such as:

2023/06/27 12:30:38 [error] 2282#2282: *83097928 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 10.10.10.100, server: orchard.example.internal, request: "POST /Controller/Watch HTTP/2.0", upstream: "grpcs://10.10.10.100:443", host: "orchard.example.internal:443"

2023/06/28 11:51:03 [error] 1253#1253: *1194409 no connection data found for keepalive http2 connection while sending request to upstream, client: 10.10.10.100, server: orchard.example.internal, request: "POST /Controller/Watch HTTP/2.0", upstream: "grpcs://10.10.10.100:443", host: "orchard.example.internal:443"

After spending a bit more time investigating the root cause, it became apparent that many other users were facing a similar issue - nginx struggling to multiplex HTTP/1.1 (and perhaps HTTP/2.0 too) and gRPC streams.

One the suggestions I've seen so far is to limit the number of keepalive connects to upstreams to 1, effectively requiring a new connection to the backend for every client that connects to nginx. Unfortunately, the nginx ingress on Kubernetes uses a single upstream {} block managed by a Lua script, so this setting has the downside of affecting all upstream connections. For high-volume systems, this isn't a long-term solution.

The configuration needs to be performed on nginx ingress controller ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-ingress-controller
data:
  upstream-keepalive-requests: "1"

This will add keep_alive requests 1; to the global upstream directive.

Once nginx applied this change, I suddenly started receiving a different error when using port forwarding - this time, the gRPC stream would be terminated by RST_STREAM, suggesting the connection was being aborted.

{"level": "warn", "ts": 1687962870.3219981, "msg": "failed to watch RPC: rpc error: code - Internal desc - stream terminated by RST_STREAM with error code: INTERNAL_ERROR"}

Turns out that for bi-directional gRPC streams, the default timeout values from nginx are too low - 60 seconds. That meant the upstream gRPC stream would be closed after that period of time. To fix this, I've had to increase a few settings:

nginx.ingress.kubernetes.io/server-snippet: |
      grpc_read_timeout 3600s;
      grpc_send_timeout 3600s;
      client_body_timeout 3600s;

I'm still getting ocasional 502s for which I don't have a good explanation yet. One option would be to try to reconnect on a 502, but I'm unsure if grpc_next_upstream works with a single backend (I'm only running a single orchard-controller instance).

nginx.ingress.kubernetes.io/configuration-snippet: |
  grpc_next_upstream error timeout http_502 non_idempotent;
  grpc_next_upstream_tries 3;

Overall, I believe orchard-controller would benefit from an optional segregation between its REST and gRPCs APIs to isolate traffic and avoid potential downsides from unexpected proxy behaviour.

¹ The orchard controller exposes a gin-tonic REST API and gRPC server on the same port and then uses the content-type header to direct traffic to one or the other.

cirruslabs / orchard

Suggestion: (optionally) separate HTTP and gRPC service ports #102