argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.79k stars 5.43k forks source link

HorizontalPodAutoscaler causes degraded status #6287

Open lindlof opened 3 years ago

lindlof commented 3 years ago

Checklist:

Describe the bug

To scale up HorizontalPodAutoscaler increases the replicas of a Deployment. That seems to cause ArgoCD to consider that the service is degraded as the number of replicas running immediately after the increase will be less than what is specified in Deployment. The status recovers back to healthy once the Deployment has managed to start the desired number of replicas.

The status shouldn't be considered degraded because it's working exactly as intended and scaling up, using standard Kubernetes practices.

We are receiving notifications when the status is degraded. We're constantly getting notifications when the deployment is scaled up.

To Reproduce

  1. Add a Deployment and a HorizontalPodAutoscaler
  2. Send traffic to scale up the deployment
  3. After autoscaling status of the application gets degraded

Expected behavior

The status shouldn't be considered degraded. Instead, it could stay healthy or be something less severe than degraded.

We expect to get notified when the status truly degrades and not during normal HorizontalPodAutoscaler operations.

Version

{
    "Version": "v1.9.0+98bec61",
    "BuildDate": "2021-01-08T07:46:29Z",
    "GitCommit": "98bec61d6154a1baac54812e5816c0d4bbc79c05",
    "GitTreeState": "clean",
    "GoVersion": "go1.14.12",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KsonnetVersion": "v0.13.1",
    "KustomizeVersion": "v3.8.1 2020-07-16T00:58:46Z",
    "HelmVersion": "v3.4.1+gc4e7485",
    "KubectlVersion": "v1.17.8",
    "JsonnetVersion": "v0.17.0"
}
zezaeoh commented 3 years ago

After upgrading the version from 1.x to 2.0, I had the same issue. đź‘€

pvlltvk commented 3 years ago

We have the same issue (after upgrading to 2.0.x), but during the deployment's rollout. It seems like maxSurge: 2 in rollingUpdate also causes the degraded status.

juris commented 3 years ago

Having the same issue with degraded HPA. Looks like it happens because HPA does not have enough metrics during a rollout. Potentially, it can be mitigated by increasing HPA cpu initialisation period --horizontal-pod-autoscaler-cpu-initialization-period In my case it is not an option, as EKS does not support it yet.

mmckane commented 3 years ago

Seeing this as well. This is causing our pipelines to fail as we validate application health as a step using argocd app wait --health, but there is about a 30second to 1min period where argocd marks the HPA as degraded after pushing a new version of a deployment. This causes the argocd app wait --health to exit with an error code failing our pipeline.

Because we are using cloud clusters we can't change the flag --horizontal-pod-autoscaler-cpu-initialization-period on the kube controller. Would be nice if there was a way around this from an argocd standpoint other than writing a custom health check that always marks HPA's as healthy.

FYI for anyone looking for a workaround to stop the degraded status from appearing at all here is the health check we are using.

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
data:
  resource.customizations: | 
    autoscaling/HorizontalPodAutoscaler:
      health.lua: |
        hs = {}
        hs.status = "Healthy"
        hs.message = "Ignoring HPA Health Check"
        return hs
pentago commented 2 years ago

While approach above works, it's rather a workaround than a solution. We should solve this on an argocd-notifications side of things somehow..

artem-kosenko commented 2 years ago

I was paing around a little and have hound this solution. Please feel free to use it and leave your feedback about your experience.

How it works:

  1. on the very begining of the application deploument/update it checks the HPA and deletes it if exist (here we need to run kubectl inside the K8s job, and to do so we have to create service account and role for it, that allows to get and delete the hpa resource of this specific app) all these are on sync-wave = -10/-5 (make sure you use correct version of kubectl based on your K8s cluster varsion)
  2. then common deployment on default sync-wave = 0
  3. then run one more K8s job with sleep inside just to wait till metric-service will have metrics of newly deployed replica-set (sleep 120 enough) PostSync, sync-wave = 0
  4. then deploy the hpa on PostSync, sync-wave = 5

added hook-delete-policy: HookSucceeded for all woraraund parts to delete them in very last shot. It leave only HPA, that was deployed in PostSync in the very end.

# templates/hpa.yaml
{{- if .Values.autoscaling.enabled }}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: {{ include "app.fullname" . }}-hpa-delete
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/sync-wave: "-10"
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: {{ include "app.fullname" . }}-hpa-delete
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/sync-wave: "-10"
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
rules:
  - apiGroups: ["autoscaling"]
    resources: ["horizontalpodautoscalers"]
    resourceNames: ["{{ include "app.fullname" . }}"]
    verbs: ["get", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: {{ include "app.fullname" . }}-hpa-delete
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/sync-wave: "-10"
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
subjects:
  - kind: ServiceAccount
    name: {{ include "app.fullname" . }}-hpa-delete
roleRef:
  kind: Role
  name: {{ include "app.fullname" . }}-hpa-delete
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "app.fullname" . }}-hpa-delete
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/sync-wave: "-5"
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  backoffLimit: 0
  template:
    spec:
      serviceAccountName: {{ include "app.fullname" . }}-hpa-delete
      restartPolicy: Never
      containers:
        - name: {{ include "app.fullname" . }}-hpa-delete
          image: public.ecr.aws/bitnami/kubectl:1.20
          imagePullPolicy: IfNotPresent
          env:
            - name: NS
              value: {{ .Release.Namespace }}
            - name: APP
              value: {{ include "app.fullname" . }}
          command:
            - /bin/bash
            - -c
            - |-
              echo -e "[INFO]\tTrying to delete HPA ${APP} in namespace ${NS}..."
              echo

              RESULT=`kubectl get hpa ${APP} -n ${NS} 2>&1`

              if [[ $RESULT =~ "Deployment/${APP}" ]]; then
                kubectl delete hpa ${APP} -n ${NS}
                echo
                echo -e "[OK]\tContinue deployment..."
                exit 0
              elif [[ $RESULT =~ "\"${APP}\" not found" ]]; then
                echo "${RESULT}"
                echo
                echo -e "[OK]\tContinue deployment..."
                exit 0
              else
                echo "${RESULT}"
                echo
                echo -e "[ERROR]\tUnexpected error. Check the log above!"
                exit 1
              fi

---
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "app.fullname" . }}-hpa-wait
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/sync-wave: "0"
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: {{ include "app.fullname" . }}-hpa-wait
          image: public.ecr.aws/docker/library/alpine:3.15.0
          imagePullPolicy: IfNotPresent
          command: ["sh", "-c", "sleep 120"]
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/sync-wave: "5"
  name: {{ include "app.fullname" . }}
  labels:
    {{- include "app.labels" . | nindent 4 }}
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ include "app.fullname" . }}
  minReplicas: {{ .Values.autoscaling.minReplicas }}
  maxReplicas: {{ .Values.autoscaling.maxReplicas }}
  metrics:
    {{- if .Values.autoscaling.cpuAverageUtilization }}
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: {{ .Values.autoscaling.cpuAverageUtilization }}
    {{- end }}
    {{- if .Values.autoscaling.memoryAverageUtilization }}
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: {{ .Values.autoscaling.memoryAverageUtilization }}
    {{- end }}
{{- end }}
noam-allcloud commented 2 years ago

Any new suggestions here?

prein commented 2 years ago

@mubarak-j shared a more sophisticated healthcheck workaround in the comment here, pasting below:

    resource.customizations.health.autoscaling_HorizontalPodAutoscaler: |
      hs = {}
      if obj.status ~= nil then
        if obj.status.conditions ~= nil then
          for i, condition in ipairs(obj.status.conditions) do
            if condition.type == "ScalingActive" and condition.reason == "FailedGetResourceMetric" then
                hs.status = "Progressing"
                hs.message = condition.message
                return hs
            end
            if condition.status == "True" then
                hs.status = "Healthy"
                hs.message = condition.message
                return hs
            end
          end
        end
        hs.status = "Healthy"
        return hs
      end
      hs.status = "Progressing"
      return hs

I'm new to custom health checks. Which is correct:

  resource.customizations: | 
    autoscaling/HorizontalPodAutoscaler:
      health.lua: |

or

  resource.customizations: |
     health.autoscaling_HorizontalPodAutoscaler: |

? The above question is also discussed here https://github.com/argoproj/argo-cd/issues/6175

mubarak-j commented 2 years ago

The new format as shown in argocd docs examples was introduced in ArgoCD v1.2.0 and explained in the blog release here

So unless you're running an older version of argocd, you will need to use the new format.

prein commented 2 years ago

@mubarak-j thanks for answering! Looking into the blog post, I'm not sure what "In the upcoming release, the resource.customizations key has been deprecated in favor of a separate ConfigMap key per resource" means.

I think I found a different issue in my setup. I'm managing argocd with the helm chart, and what I came up in my values.yaml based on outdated documentation was

argo-cd:
  server:
    config:
      resourceCustomizations: |
        health.autoscaling_HorizontalPodAutoscaler: |
          hs = {}
          [...]

which, I guess, was ignored. I thought that there was some translation between helm values and the cm, while I could simply do:

argo-cd:
  server:
    config:
      resource.customizations.health.autoscaling_HorizontalPodAutoscaler: |
          hs = {}
          [...]

Let's see if it works

BTW It would be great if there was a way to list/show resource customizations

mubarak-j commented 2 years ago

You can find argocd built-in resource customizations here: https://github.com/argoproj/argo-cd/tree/master/resource_customizations

chris-ng-scmp commented 2 years ago

This is a comprehensive custom health check for HPA

I also added a condition to make sure apiVersion is not v1, as v1 only contains status in the annotation

    resource.customizations.useOpenLibs.autoscaling_HorizontalPodAutoscaler: "true"
    resource.customizations.health.autoscaling_HorizontalPodAutoscaler: |
      hs = {}
      hsScalingActive = {}
      if obj.apiVersion == 'autoscaling/v1' then
          hs.status = "Degraded"
          hs.message = "Please upgrade the apiVersion to the latest."
          return hs
      end
      if obj.status ~= nil then
        if obj.status.conditions ~= nil then
          for i, condition in ipairs(obj.status.conditions) do
            if condition.status == "False" and condition.type ~= 'ScalingActive' then
                hs.status = "Degraded"
                hs.message = condition.message
                return hs
            end
            if condition.type == "ScalingActive" and condition.reason == "FailedGetResourceMetric" and condition.status then
                if string.find(condition.message, "missing request for") then
                  hs.status = "Degraded"
                  hs.message = condition.message
                  return hs
                end
                hsScalingActive.status = "Progressing"
                hsScalingActive.message = condition.message
            end
          end
          if hs.status ~= nil then
            return hs
          end
          if hsScalingActive.status ~= nil then
            return hsScalingActive
          end
          hs.status = "Healthy"
          return hs
        end
      end
      hs.status = "Progressing"
      return hs
zdraganov commented 2 years ago

Anyone having idea for a workaround in the Koncrete (https://www.koncrete.dev/) hosted ArgoCD? We do not have access to the K8S API, so no option for applying those customizations.