Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 306 forks source link

[BUG] Retina agent degrades network throughput on nodes to 30% of original speed #4508

Open grzesuav opened 2 months ago

grzesuav commented 2 months ago

Describe the bug After upgrade of control plane we noticed that on some cluster we have degraded network throughput on nodes. Some network intensive pods were suffering from lack of badwitch. After some experiment it was pinpointed to retina-agent being suddently installed on our clusters.

To Reproduce

Expected behavior A clear and concise description of what you expected to happen.

Screenshots Lower

Environment (please complete the following information):

Additional context Add any other context about the problem here. Pod network throughput. ON the drop retina agent was added to the nodes, after removal network throuput was restored to original speed.

image image

Currently we cannot remove just retina agent in AKS official way (or I am not aware how to do it) - I am usnure why it was installed in first place after upgrading clusters to 1.29

As a workaround we use admission webhook to add nonexisting selector for nodes which prevent pods from being scheduled, but I want to :

grzesuav commented 2 months ago

Original manifest from cluster below, our mutation webhook is adding

              - key: non-existing-key
                operator: Exists

to prevent scheduling.

❯ k get daemonsets.apps -n kube-system retina-agent -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
    meta.helm.sh/release-name: aks-managed-kappie
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-08-24T14:13:08Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    helm.toolkit.fluxcd.io/name: kappie-adapter-helmrelease
    helm.toolkit.fluxcd.io/namespace: 64f7349df0994400019581c9
    k8s-app: retina
    kubernetes.azure.com/managedby: aks
  name: retina-agent
  namespace: kube-system
  resourceVersion: "839996452"
  uid: 1b9ce971-598a-4d23-b43f-9f2db17b8036
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: retina
  template:
    metadata:
      annotations:
        prometheus.io/port: "10093"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        k8s-app: retina
        kubernetes.azure.com/managedby: aks
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.azure.com/cluster
                operator: Exists
              - key: kubernetes.azure.com/os-sku
                operator: NotIn
                values:
                - CBLMariner
              - key: kubernetes.azure.com/ebpf-dataplane
                operator: NotIn
                values:
                - cilium
              - key: type
                operator: NotIn
                values:
                - virtual-kubelet
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
              - key: non-existing-key
                operator: Exists
      containers:
      - args:
        - --health-probe-bind-address=:18081
        - --metrics-bind-address=:18080
        - --config
        - /kappie/config/config.yaml
        command:
        - /kappie/controller
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: NODE_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: mcr.microsoft.com/containernetworking/kappie-agent:v0.1.4
        imagePullPolicy: IfNotPresent
        name: retina
        ports:
        - containerPort: 10093
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /metrics
            port: 10093
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 500m
            memory: 300Mi
          requests:
            cpu: 100m
            memory: 200Mi
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
            - SYS_RESOURCE
            - NET_ADMIN
            - NET_RAW
            - IPC_LOCK
            drop:
            - ALL
          privileged: false
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /sys/kernel/debug
          name: debug
        - mountPath: /sys/kernel/tracing
          name: trace
        - mountPath: /sys/fs/bpf
          name: bpf
        - mountPath: /sys/fs/cgroup
          name: cgroup
        - mountPath: /tmp
          name: tmp
        - mountPath: /kappie/config
          name: config
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - image: mcr.microsoft.com/containernetworking/kappie-init:v0.1.4
        imagePullPolicy: IfNotPresent
        name: init-retina
        resources: {}
        securityContext:
          capabilities:
            drop:
            - ALL
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /sys/fs/bpf
          mountPropagation: Bidirectional
          name: bpf
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: retina-agent
      serviceAccountName: retina-agent
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      volumes:
      - hostPath:
          path: /sys/kernel/debug
          type: ""
        name: debug
      - hostPath:
          path: /sys/kernel/tracing
          type: ""
        name: trace
      - hostPath:
          path: /sys/fs/bpf
          type: ""
        name: bpf
      - hostPath:
          path: /sys/fs/cgroup
          type: ""
        name: cgroup
      - emptyDir: {}
        name: tmp
      - configMap:
          defaultMode: 420
          name: retina-config
        name: config
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 0
  desiredNumberScheduled: 0
  numberMisscheduled: 0
  numberReady: 0
  observedGeneration: 1
snguyen64 commented 2 months ago

Hi @grzesuav Does this cluster have ama metrics enabled? https://learn.microsoft.com/en-us/azure/aks/network-observability-managed-cli?tabs=newer-k8s-versions#create-cluster

Retina-agent will be installed on clusters with ama-metrics and k8s versions >= 1.29

If you have the aks-preview cli, you can enable/disable ama-metrics with az aks update --disable-azure-monitor-metrics --name <cluster-name> --resource-group <resource-group>

other documentation on monitoring can be found here too https://learn.microsoft.com/en-us/azure/azure-monitor/containers/kubernetes-monitoring-enable?tabs=cli

grzesuav commented 2 months ago

Hi, I want to keep control plane metrics, is there an option to only remove retina ?

Also, since it is degrading node networking why it is enabled automatically?

snguyen64 commented 2 months ago

Since retina is bundled with ama-metrics now for k8s 1.29, as of now, the ways to disable retina-agent is to disable monitoring. Retina team is currently investigating any perf issues regarding this

vakalapa commented 1 month ago

We have tried to repro this internally, in multiple tests we were able to repo ~20% drop for INTRA node traffic (between pods running on the same node) and almost negligible difference for INTER node traffic (between pods running on different nodes).

In OSS retina we are working on a performance pipeline where the tests can be public. https://github.com/microsoft/retina/issues/655