envoy not load balancing grpc connections effectively

apatwal-suki commented 1 month ago

Title: envoy not load balancing grpc connections effectively

Description:

What issue is being seen? In Kubernetes, for a GPU-based application(ms-dummy-asr-v2-ambient) with multiple pods fronted by envoy(envoy-ms-dummy-asr-v2-ambient), I see that some requests are failing and fail to reach any of the available pods.

Repro steps:

Envoy config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-ms-dummy-asr-v2-ambient-conf
data:
  envoy.yaml: |
    node:
      id: "nvidia"
      cluster: "dummy-dev"
    static_resources:
      listeners:
      - address:
          socket_address:
            address: 0.0.0.0
            port_value: 443
        connection_balance_config:
          exact_balance: {}
        filter_chains:
        - filters:
          - name: envoy.http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
              access_log:
              - name: envoy.file_access_log
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
                  path: "/dev/stdout"
              codec_type: AUTO
              stat_prefix: ingress_https
              route_config:
                name: local_route
                virtual_hosts:
                - name: https
                  domains:
                  - "*"
                  routes:
                  - match:
                      prefix: "/"
                    route:
                      cluster: ms-dummy-asr-v2-ambient
                      timeout: 3600s
                    response_headers_to_add:
                      - header:
                          key: "x-content-type-options"
                          value: "nosniff"
                        append: true
                      - header:
                          key: "x-frame-options"
                          value: "deny"
                        append: true
                      - header:
                          key: "x-xss-protection"
                          value: "1; mode=block"
                        append: true
                      - header:
                          key: "strict-transport-security"
                          value: "max-age=63072000; includeSubDomains; preload;"
                        append: true
                      - header:
                          key: "content-security-policy"
                          value: "default-src 'none'; frame-ancestors 'none';"
                        append: true
              http_filters:
              - name: envoy.health_check
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
                  pass_through_mode: false
                  headers:
                  - name: ":path"
                    exact_match: "/healthz"
                  - name: "x-envoy-livenessprobe"
                    exact_match: "healthz"
              - name: envoy.filters.http.router
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          transport_socket:
            name: envoy.transport_sockets.tls
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
              common_tls_context:
                tls_certificates:
                  - certificate_chain:
                      filename: "/etc/ssl/envoy/tls.crt"
                    private_key:
                      filename: "/etc/ssl/envoy/tls.key"
                alpn_protocols: ["h2", "http/1.1"]
      clusters:
      - name: ms-dummy-asr-v2-ambient
        connect_timeout: 0.5s
        type: STRICT_DNS
        dns_lookup_family: V4_ONLY
        lb_policy: ROUND_ROBIN
        common_http_protocol_options: {"idle_timeout": 10s}
        http2_protocol_options: {}
        load_assignment:
          cluster_name: ms-dummy-asr-v2-ambient
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: ms-dummy-asr-v2-ambient.microservices.svc.cluster.local
                    port_value: 10001
    admin:
      access_log_path: "/dev/stdout"
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 8090

Envoy deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: envoy-ms-dummy-asr-v2-ambient
  namespace: microservices
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: envoy-ms-dummy-asr-v2-ambient
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: envoy-ms-dummy-asr-v2-ambient
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - envoy-ms-dummy-asr-v2-ambient
            topologyKey: kubernetes.io/hostname
      containers:
      - env:
        - name: ENVOY_UID
          value: "0"
        image: envoyproxy/envoy:tools-v1.29.7
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            httpHeaders:
            - name: x-envoy-livenessprobe
              value: healthz
            path: /healthz
            port: 443
            scheme: HTTPS
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: envoy-ms-dummy-asr-v2-ambient
        ports:
        - containerPort: 443
          name: https
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            httpHeaders:
            - name: x-envoy-livenessprobe
              value: healthz
            path: /healthz
            port: 443
            scheme: HTTPS
          initialDelaySeconds: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: "7"
            memory: 2Gi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - any
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/envoy
          name: config
        - mountPath: /etc/ssl/envoy
          name: certs
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: certs
        secret:
          defaultMode: 420
          secretName: letsencrypt-dummy-dev-tls-digicert
      - configMap:
          defaultMode: 420
          name: envoy-ms-dummy-asr-v2-ambient-conf
        name: config

With the above config and doing a load-test, even when I have 20 pods of ms-dummy-asr-v2-ambient(which should be able to support 40 concurrent requests, as in our application, we have a hard limit of 2 connections per pod), I see connection failures for a lot of requests(this is not a fixed number). On increasing the CPU or the number of envoy pods however, the number of failures does decrease. However, this does not follow any deterministic pattern. For instance, with 1 envoy pod, around 26 requests return a success; with 5 envoy pods, around 37 requests return a success.

From the above observation, it does seem like something to do with load-balancing among the worker threads. As you can see from the above config, I do have exact_balance config in place. What am I missing here?

htuch commented 1 month ago

@yanavlasov

htuch commented 1 month ago

I don't think exact balance helps here, as this is on the listener side for client connections. Each client connection will have its own independent request streams on the worker it lands at, spread across multiple connections to the backend. The more pods you have, the more independent instances of Envoy you have. I'm not sure if you can really guarantee this hard limits on a per-backend basis easily this way - I think you might have better luck with a more distributed approach, e.g. retries + outlier detection, custom load balancing that is aware of backend capacity directly, etc.

bburli commented 1 month ago

@htuch Thank you for the reply. I work with @apatwal-suki (OP).

If we use gRPC health check and mark the upstream pod as "unhealthy" by hosting gRPC health check endpoint and responding with non-200 status code once it has 2 connections, does that force envoy to route traffic to different pod on next incoming connection?

htuch commented 1 month ago

It will send to a different host (unless in panic mode), but I don't believe we do inline checks with every request, so the health check will be out of sync potentially and you might see failures here during periods of inconsistency.

Out of curiosity, if you have exactly 1 Envoy pod and 1 worker thread, with a round-robin policy, are you seeing the same behavior?

You might also want to share your /clusters and /stats dumps from the admin service to see what is going on.

bburli commented 1 month ago

It will send to a different host (unless in panic mode), but I don't believe we do inline checks with every request, so the health check will be out of sync potentially and you might see failures here during periods of inconsistency.

Out of curiosity, if you have exactly 1 Envoy pod and 1 worker thread, with a round-robin policy, are you seeing the same behavior?

You might also want to share your /clusters and /stats dumps from the admin service to see what is going on.

We have to try that. We tried with an envoy pod that had 8 cores so 8 worker threads were spawned and we noticed same behaviour: requests started failing after a certain amount. Exact configuration we tried was 40 requests and I think 35-37 of them succeed and others fail (as if envoy rejects them - they don't land on upstream service).

cc @apatwal-suki to add more here.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 weeks ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

bburli commented 4 days ago

FWIW, this turned out to be a mistake in configuration where we didn't mark the upstream nodes as "headless" while using GKE. Once did the same, kubernetes network proxy stopped interfering and envoy load balancing worked properly. After that selecting https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/load_balancers#weighted-least-request did the trick for us.

Just in case it helps anyone else.

cc @htuch @apatwal-suki

envoyproxy / envoy

envoy not load balancing grpc connections effectively #35083