Imbalanced traffic spike to backends during scale-up/scale-downs... but only for slow requests.

diranged commented 1 year ago

Note, I am moving https://github.com/istio/istio/issues/45013 into the envoy project per the recommendation from the Istio team. This is a copy of the original bug I opened with Istio a while back.

Bug Description

We've been working to migrate a new workload into our Kubernetes cluster that operates with an Istio mesh. This mesh has been operating fine for most workloads at a good scale (500k QPS+ across the mesh). The team migrating their workload over though noticed a strange behavior and we've been able to replicate the behavior with a synthetic test.

The behavior they see is that during backend-pod changes (either a scale up, or rollout, or scale down) - the backing pods suddenly get a significant traffic imbalance. One pod will take "most" of the traffic for 10-30s while the other pods drop their traffic rate significantly.

Graphing out the requests using Loki (following the istio-proxy access logs), we can see the behavior: In this example, we are adding a new pod (the green line), but we see a massive spike to the red line, and a drop in traffic to the blue line.

We've spent weeks troubleshooting this before we set out to replicate it synthetically. We have a number of other high volume workloads that do not exhibit this behavior, which led us to spend time troubleshooting the application for rather than the mesh.

The Test Setup

Service: Running httpbin across ~3-5 pods Client(s):

1000 pods running curl every second.
10 pods running h2load as fast as they can (generating ~4000-5000 QPS)

Test 1: Fast Requests, Scale down from 4 -> 3 backends The early tests we were running were as fast as possible... hitting the httpbin endpoint /status/200 that returns immediately. We ran many scale-up and scale-down tests. Every time, the pattern looked good (we ran dozens of tests, they all looked roughly like this):

3 -> 4 pods: Here we can see the new pod comes up and the traffic transition is pretty graceful. There's a bit of a change in the total traffic volume because we were overloading the backend pods for the test on purpose

4 -> 3 pods: This again shows a pretty clean transition

Test 2: Slower Requests

The key for us to reproduce the problem was to use slower requests - the backend application in this case responds on average in ~150ms. So instead of hitting /status/200 we are now going to hit /delay/.15 and run the same tests. It's worth noting that because the requests take longer, the total request volume drops significantly. That's fine, it's more reflective of our actual traffic volume.

3 -> 4 pods: We would expect similar traffic distribution ... but instead we see a significant spike on one backend, with drops in the other backends. Of course "significant" depends on the application... our backend application actually behaves worse than this graph depicts, but httpbin is a nice fast async service ... and the backend service in this question is a python service that is much slower.

5 -> 4 -> 3: When scaling down, we can see a pretty exaggerated effect as well...

5 -> 5: During a pod replacement (one of our backends is replaced), you can see a really dramatic load imabalance:

Test 3: Really Slow (1s) Requests

Just for fun, I ran the same tests against /delay/1 to test one second long requests..

3 -> 4 -> 3: In this window you can see us scale from 3 to 4 and back to 3 pods... The behavior isn't quite the same as the tests above, but again you can see pretty significant imbalances.

Additional Note: First one wins?

While digging into this, we discovered that almost all of the time the "first" pod in the client envoy config list is the one that gets the traffic spike. This implies to us that something is happening in the envoy client when the list of backing pods changes, and then they "start over" in some way and connect to the first pod in the list. We tried every type of load balancing algorithm and none of them changed the behaviors in any way.

Version

$ istioctl version
client version: 1.15.0
control plane version: 1.16.0
data plane version: 1.16.0 (1160 proxies)

$ kubectl version --short
Client Version: v1.27.1
Kustomize Version: v5.0.1
Server Version: v1.24.10-eks-48e63af


### Additional Information

Here is the test configuration ... you can change the httpbin url's to test different scenarios:

**svc.yaml**
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: mysvc
spec:
  policyTypes: [Ingress]
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: mysvc
      app.kubernetes.io/name: mysvc
  ingress:
    - from: []
      ports:
        - port: 80
          protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  name: mysvc
spec:
  internalTrafficPolicy: Cluster
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  sessionAffinity: None
  type: ClusterIP
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: http
  selector:
    app.kubernetes.io/instance: mysvc
    app.kubernetes.io/name: mysvc
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysvc
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app.kubernetes.io/instance: mysvc
      app.kubernetes.io/name: mysvc
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'
        proxy.istio.io/overrides: "{ \n  \"containers\": [\n    { \n      \"name\":
            \"istio-proxy\",\n      \"lifecycle\": {\n        \"preStop\": {\n          \"exec\":
            {\n            \"command\": [\n              \"/bin/sh\",\n              \"-c\",\n
            \             \"while [ $(netstat -plunt | grep tcp | egrep -v 'envoy|pilot-agent'
            | wc -l | xargs) -ne 0 ]; do sleep 1; done\"\n            ]\n          }\n
            \       }\n      }\n    }\n  ]\n}"
        traffic.sidecar.istio.io/excludeInboundPorts: "9090"
      labels:
        app.kubernetes.io/instance: mysvc
        app.kubernetes.io/name: mysvc
    spec:
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/arch: amd64
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      terminationGracePeriodSeconds: 300
      containers:
        - name: httpbin
          image: kennethreitz/httpbin
          imagePullPolicy: IfNotPresent
          env:
            - name: RELEASE
              value: junk
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sleep
                  - "10"
          ports:
            - containerPort: 80
              name: http
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              memory: 2400Mi
            requests:
              cpu: 512m
              memory: 2400Mi
          securityContext: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File

client.yaml

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myclient
spec:
  progressDeadlineSeconds: 600
  replicas: 10
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: myclient
      app.kubernetes.io/name: myclient
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'
      labels:
        app.kubernetes.io/instance: myclient
        app.kubernetes.io/name: myclient
    spec:
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/arch: amd64
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      terminationGracePeriodSeconds: 300
      containers:
        - name: client
          image: centminmod/docker-ubuntu-nghttp2-minimal:latest
          imagePullPolicy: IfNotPresent
          command:
            - /usr/bin/h2load
          args:
            - --h1
            - --duration=1800               # seconds
            - --connection-active-timeout=1m
            - http://mysvc/delay/.150
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sleep
                  - "10"
          resources:
            limits:
              memory: 1Gi
            requests:
              cpu: 1
              memory: 1Gi
          securityContext: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myclient-sleeper
  namespace: monolith
spec:
  progressDeadlineSeconds: 600
  replicas: 1000
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: myclient-sleeper
      app.kubernetes.io/name: myclient
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'
      labels:
        app.kubernetes.io/instance: myclient-sleeper
        app.kubernetes.io/name: myclient
    spec:
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/arch: amd64
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      terminationGracePeriodSeconds: 300
      containers:
        - name: client
          image: centminmod/docker-ubuntu-nghttp2-minimal:latest
          imagePullPolicy: IfNotPresent
          command:
            - /bin/sh
          args:
            - -c
            - "while true; do curl http://mysvc/delay/.150; sleep 1; done"
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sleep
                  - "10"
          resources:
            limits:
              memory: 256Mi
            requests:
              cpu: 100m
              memory: 256Mi
          securityContext: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File

Affected product area

[ ] Ambient
[ ] Docs
[ ] Installation
[ ] Networking
[X] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
[ ] Upgrade
[ ] Multi Cluster
[ ] Virtual Machine
[ ] Control Plane Revisions

diranged commented 1 year ago

Can anyone in the Envoy team take a look at this issue?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

diranged commented 1 year ago

Not stale. Can we get any eyes on this?

nezdolik commented 1 year ago

This issue sounds like duplicate of https://github.com/envoyproxy/envoy/issues/29008, @diranged could you check if disabling locality weighted load balancing helps and works for you?

diranged commented 1 year ago

@nezdolik We don't have locality weighted load balancing turned on in this set of tests. Our Istio environment has the localityLbSetting setting turned on, but we don't have outlier detection turned on for these tests which should mean that the configuration is ignored. Furthermore, in our case we aren't using any slow start config. Do you still think the issue is the same, or related?

nezdolik commented 1 year ago

Apologies, i was skimming through all opened issues with keywords "traffic imbalance+slow start" to post an update about potential root cause, should have read the report more carefully.

diranged commented 1 year ago

No worries - I definitely still think there's an interesting bug here to track down..

palmobar commented 1 year ago

@diranged we also don't have outlier settings turned on but the locality still affects the load balancing so it might be related.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

envoyproxy / envoy