Open lindlof opened 3 years ago
After upgrading the version from 1.x to 2.0, I had the same issue. đź‘€
We have the same issue (after upgrading to 2.0.x), but during the deployment's rollout. It seems like maxSurge: 2
in rollingUpdate
also causes the degraded status.
Having the same issue with degraded HPA. Looks like it happens because HPA does not have enough metrics during a rollout. Potentially, it can be mitigated by increasing HPA cpu initialisation period --horizontal-pod-autoscaler-cpu-initialization-period
In my case it is not an option, as EKS does not support it yet.
Seeing this as well. This is causing our pipelines to fail as we validate application health as a step using argocd app wait --health
, but there is about a 30second to 1min period where argocd marks the HPA as degraded after pushing a new version of a deployment. This causes the argocd app wait --health to exit with an error code failing our pipeline.
Because we are using cloud clusters we can't change the flag --horizontal-pod-autoscaler-cpu-initialization-period on the kube controller. Would be nice if there was a way around this from an argocd standpoint other than writing a custom health check that always marks HPA's as healthy.
FYI for anyone looking for a workaround to stop the degraded status from appearing at all here is the health check we are using.
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cm
data:
resource.customizations: |
autoscaling/HorizontalPodAutoscaler:
health.lua: |
hs = {}
hs.status = "Healthy"
hs.message = "Ignoring HPA Health Check"
return hs
While approach above works, it's rather a workaround than a solution. We should solve this on an argocd-notifications side of things somehow..
I was paing around a little and have hound this solution. Please feel free to use it and leave your feedback about your experience.
How it works:
added hook-delete-policy: HookSucceeded for all woraraund parts to delete them in very last shot. It leave only HPA, that was deployed in PostSync in the very end.
# templates/hpa.yaml
{{- if .Values.autoscaling.enabled }}
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: {{ include "app.fullname" . }}-hpa-delete
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/sync-wave: "-10"
argocd.argoproj.io/hook-delete-policy: HookSucceeded
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: {{ include "app.fullname" . }}-hpa-delete
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/sync-wave: "-10"
argocd.argoproj.io/hook-delete-policy: HookSucceeded
rules:
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
resourceNames: ["{{ include "app.fullname" . }}"]
verbs: ["get", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: {{ include "app.fullname" . }}-hpa-delete
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/sync-wave: "-10"
argocd.argoproj.io/hook-delete-policy: HookSucceeded
subjects:
- kind: ServiceAccount
name: {{ include "app.fullname" . }}-hpa-delete
roleRef:
kind: Role
name: {{ include "app.fullname" . }}-hpa-delete
apiGroup: rbac.authorization.k8s.io
---
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "app.fullname" . }}-hpa-delete
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/sync-wave: "-5"
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
backoffLimit: 0
template:
spec:
serviceAccountName: {{ include "app.fullname" . }}-hpa-delete
restartPolicy: Never
containers:
- name: {{ include "app.fullname" . }}-hpa-delete
image: public.ecr.aws/bitnami/kubectl:1.20
imagePullPolicy: IfNotPresent
env:
- name: NS
value: {{ .Release.Namespace }}
- name: APP
value: {{ include "app.fullname" . }}
command:
- /bin/bash
- -c
- |-
echo -e "[INFO]\tTrying to delete HPA ${APP} in namespace ${NS}..."
echo
RESULT=`kubectl get hpa ${APP} -n ${NS} 2>&1`
if [[ $RESULT =~ "Deployment/${APP}" ]]; then
kubectl delete hpa ${APP} -n ${NS}
echo
echo -e "[OK]\tContinue deployment..."
exit 0
elif [[ $RESULT =~ "\"${APP}\" not found" ]]; then
echo "${RESULT}"
echo
echo -e "[OK]\tContinue deployment..."
exit 0
else
echo "${RESULT}"
echo
echo -e "[ERROR]\tUnexpected error. Check the log above!"
exit 1
fi
---
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "app.fullname" . }}-hpa-wait
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/sync-wave: "0"
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: {{ include "app.fullname" . }}-hpa-wait
image: public.ecr.aws/docker/library/alpine:3.15.0
imagePullPolicy: IfNotPresent
command: ["sh", "-c", "sleep 120"]
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/sync-wave: "5"
name: {{ include "app.fullname" . }}
labels:
{{- include "app.labels" . | nindent 4 }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "app.fullname" . }}
minReplicas: {{ .Values.autoscaling.minReplicas }}
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
{{- if .Values.autoscaling.cpuAverageUtilization }}
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.cpuAverageUtilization }}
{{- end }}
{{- if .Values.autoscaling.memoryAverageUtilization }}
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.memoryAverageUtilization }}
{{- end }}
{{- end }}
Any new suggestions here?
@mubarak-j shared a more sophisticated healthcheck workaround in the comment here, pasting below:
resource.customizations.health.autoscaling_HorizontalPodAutoscaler: |
hs = {}
if obj.status ~= nil then
if obj.status.conditions ~= nil then
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "ScalingActive" and condition.reason == "FailedGetResourceMetric" then
hs.status = "Progressing"
hs.message = condition.message
return hs
end
if condition.status == "True" then
hs.status = "Healthy"
hs.message = condition.message
return hs
end
end
end
hs.status = "Healthy"
return hs
end
hs.status = "Progressing"
return hs
I'm new to custom health checks. Which is correct:
resource.customizations: |
autoscaling/HorizontalPodAutoscaler:
health.lua: |
or
resource.customizations: |
health.autoscaling_HorizontalPodAutoscaler: |
? The above question is also discussed here https://github.com/argoproj/argo-cd/issues/6175
The new format as shown in argocd docs examples was introduced in ArgoCD v1.2.0 and explained in the blog release here
So unless you're running an older version of argocd, you will need to use the new format.
@mubarak-j thanks for answering! Looking into the blog post, I'm not sure what "In the upcoming release, the resource.customizations key has been deprecated in favor of a separate ConfigMap key per resource" means.
I think I found a different issue in my setup. I'm managing argocd with the helm chart, and what I came up in my values.yaml based on outdated documentation was
argo-cd:
server:
config:
resourceCustomizations: |
health.autoscaling_HorizontalPodAutoscaler: |
hs = {}
[...]
which, I guess, was ignored. I thought that there was some translation between helm values and the cm, while I could simply do:
argo-cd:
server:
config:
resource.customizations.health.autoscaling_HorizontalPodAutoscaler: |
hs = {}
[...]
Let's see if it works
BTW It would be great if there was a way to list/show resource customizations
You can find argocd built-in resource customizations here: https://github.com/argoproj/argo-cd/tree/master/resource_customizations
This is a comprehensive custom health check for HPA
I also added a condition to make sure apiVersion is not v1, as v1 only contains status in the annotation
resource.customizations.useOpenLibs.autoscaling_HorizontalPodAutoscaler: "true"
resource.customizations.health.autoscaling_HorizontalPodAutoscaler: |
hs = {}
hsScalingActive = {}
if obj.apiVersion == 'autoscaling/v1' then
hs.status = "Degraded"
hs.message = "Please upgrade the apiVersion to the latest."
return hs
end
if obj.status ~= nil then
if obj.status.conditions ~= nil then
for i, condition in ipairs(obj.status.conditions) do
if condition.status == "False" and condition.type ~= 'ScalingActive' then
hs.status = "Degraded"
hs.message = condition.message
return hs
end
if condition.type == "ScalingActive" and condition.reason == "FailedGetResourceMetric" and condition.status then
if string.find(condition.message, "missing request for") then
hs.status = "Degraded"
hs.message = condition.message
return hs
end
hsScalingActive.status = "Progressing"
hsScalingActive.message = condition.message
end
end
if hs.status ~= nil then
return hs
end
if hsScalingActive.status ~= nil then
return hsScalingActive
end
hs.status = "Healthy"
return hs
end
end
hs.status = "Progressing"
return hs
Anyone having idea for a workaround in the Koncrete (https://www.koncrete.dev/) hosted ArgoCD? We do not have access to the K8S API, so no option for applying those customizations.
Checklist:
argocd version
.Describe the bug
To scale up
HorizontalPodAutoscaler
increases thereplicas
of aDeployment
. That seems to cause ArgoCD to consider that the service is degraded as the number of replicas running immediately after the increase will be less than what is specified inDeployment
. The status recovers back to healthy once theDeployment
has managed to start the desired number of replicas.The status shouldn't be considered degraded because it's working exactly as intended and scaling up, using standard Kubernetes practices.
We are receiving notifications when the status is degraded. We're constantly getting notifications when the deployment is scaled up.
To Reproduce
Expected behavior
The status shouldn't be considered degraded. Instead, it could stay healthy or be something less severe than degraded.
We expect to get notified when the status truly degrades and not during normal
HorizontalPodAutoscaler
operations.Version