Open markh123 opened 1 year ago
This issue is stale because it has been open 60 days with no activity.
/remove stale (didn't work but same effect)
This issue is stale because it has been open 60 days with no activity.
A similar issue is happening with my setup. It only happens when a traffic control configuration is in place and HPA has memory-based autoscaling. I can only avoid this when I set the memory utilization threshold to a number that is unrealistically high, like 95% or 99%.
No matter what the average or maximum memory utilisation is during the rollout, the HPA reports New size: X; reason: memory resource utilization (percentage of request) above target
and scales the stable and the canary to max only at the last step of the rollout.
This is my HPA (managed by Keda)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
labels:
app.kubernetes.io/managed-by: keda-operator
app.kubernetes.io/name: keda-hpa-xxx
app.kubernetes.io/part-of: xxx
app.kubernetes.io/version: 2.12.1
scaledobject.keda.sh/name: xxx
name: keda-hpa-xxx
namespace: xxxxx
ownerReferences:
- apiVersion: keda.sh/v1alpha1
blockOwnerDeletion: true
controller: true
kind: ScaledObject
name: xxx
spec:
maxReplicas: 40
metrics:
- external:
metric:
name: s1-cron-....
selector:
matchLabels:
scaledobject.keda.sh/name: xxx
target:
averageValue: '1'
type: AverageValue
type: External
- resource:
name: memory
target:
averageUtilization: 85
type: Utilization
type: Resource
- resource:
name: cpu
target:
averageUtilization: 50
type: Utilization
type: Resource
minReplicas: 5
scaleTargetRef:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
name: xxx
status:
conditions:
- lastTransitionTime: '2023-09-14T13:36:39Z'
message: recommended size matches current size
reason: ReadyForNewScale
status: 'True'
type: AbleToScale
- lastTransitionTime: '2024-01-02T15:34:09Z'
message: >-
the HPA was able to successfully calculate a replica count from cpu
resource utilization (percentage of request)
reason: ValidMetricFound
status: 'True'
type: ScalingActive
- lastTransitionTime: '2024-01-03T11:00:26Z'
message: the desired count is within the acceptable range
reason: DesiredWithinRange
status: 'False'
type: ScalingLimited
currentMetrics:
- external:
current:
averageValue: 200m
value: '0'
metric:
name: s1-cron-....
selector:
matchLabels:
scaledobject.keda.sh/name: ...
type: External
- resource:
current:
averageUtilization: 52
averageValue: 275596902400m
name: memory
type: Resource
- resource:
current:
averageUtilization: 49
averageValue: 249m
name: cpu
type: Resource
currentReplicas: 5
desiredReplicas: 5
lastScaleTime: '2024-01-03T10:40:37Z'
The scaling up continues until the max and after some time it goes down and rollout completes or I need to promote to full at the last step to avoid scaling max.
This is how my HPA looks during the last step:
during the scaling up at the last step this is what my HPA says
Scaling config:
- resource:
name: memory
target:
averageUtilization: 85
type: Utilization
status:
status:
conditions:
- lastTransitionTime: '2023-09-14T13:36:39Z'
message: >-
recent recommendations were higher than current one, applying the
highest recent recommendation
reason: ScaleDownStabilized
status: 'True'
type: AbleToScale
- lastTransitionTime: '2024-01-02T15:34:09Z'
message: >-
the HPA was able to successfully calculate a replica count from memory
resource utilization (percentage of request)
reason: ValidMetricFound
status: 'True'
type: ScalingActive
- lastTransitionTime: '2024-01-03T11:00:26Z'
message: the desired count is within the acceptable range
reason: DesiredWithinRange
status: 'False'
type: ScalingLimited
currentMetrics:
- external:
current:
averageValue: 67m
value: '0'
metric:
name: s1-cron-...
selector:
matchLabels:
scaledobject.keda.sh/name: xxx
type: External
- resource:
current:
averageUtilization: 46
averageValue: 244952268800m
name: memory
type: Resource
- resource:
current:
averageUtilization: 3
averageValue: 17m
name: cpu
type: Resource
currentReplicas: 15
desiredReplicas: 15
lastScaleTime: '2024-01-03T12:33:13Z'
Right after the rollout completed the HPA starts scaling the replicaset down
I have upgraded my Argo Rollout helm chart from 2.32.2 to 2.34.0 and app version from 1.6.2 to 1.6.4. The issue persists.
After deleting all my gatekeeper policies I get the same issue.
In the Argo rollouts logs this was a bit confusing to me
2024-01-03T16:10:10+01:00 time="2024-01-03T15:10:10Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":11,\"availableReplicas\":11,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:10:08Z\",\"lastUpdateTime\":\"2024-01-03T15:10:08Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:10:10Z\",\"lastUpdateTime\":\"2024-01-03T15:10:10Z\",\"message\":\"Rollout is healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"True\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:10:10Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" has successfully progressed.\",\"reason\":\"NewReplicaSetAvailable\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":11,\"replicas\":11}}" generation=1065 namespace=my-namespace resourceVersion=751073974 rollout=app-name
...
2024-01-03T16:09:47+01:00 time="2024-01-03T15:09:47Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":13,\"availableReplicas\":12,\"readyReplicas\":12,\"replicas\":13}}" generation=1065 namespace=my-namespace resourceVersion=751073492 rollout=app-name
...
2024-01-03T16:09:37+01:00 time="2024-01-03T15:09:37Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":14,\"availableReplicas\":12,\"readyReplicas\":12,\"replicas\":14}}" generation=1065 namespace=my-namespace resourceVersion=751073328 rollout=app-name
...
2024-01-03T16:09:12+01:00 time="2024-01-03T15:09:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":20,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:09:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":20,\"updatedReplicas\":11}}" generation=1065 namespace=my-namespace resourceVersion=751072932 rollout=app-name
...
2024-01-03T16:09:12+01:00 time="2024-01-03T15:09:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":19,\"replicas\":19}}" generation=1065 namespace=my-namespace resourceVersion=751072919 rollout=app-name
...
2024-01-03T16:08:42+01:00 time="2024-01-03T15:08:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":18,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:08:42Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":18,\"updatedReplicas\":10}}" generation=1064 namespace=my-namespace resourceVersion=751072501 rollout=app-name
...
2024-01-03T16:08:12+01:00 time="2024-01-03T15:08:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":17,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:08:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":17,\"updatedReplicas\":9}}" generation=1063 namespace=my-namespace resourceVersion=751072060 rollout=app-name
...
2024-01-03T16:08:12+01:00 time="2024-01-03T15:08:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":16,\"replicas\":16}}" generation=1063 namespace=my-namespace resourceVersion=751072045 rollout=app-name
...
2024-01-03T16:07:42+01:00 time="2024-01-03T15:07:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":15,\"message\":\"waiting for all steps to complete\",\"replicas\":15,\"updatedReplicas\":8}}" generation=1062 namespace=my-namespace resourceVersion=751071598 rollout=app-name
...
2024-01-03T16:07:42+01:00 time="2024-01-03T15:07:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":14,\"replicas\":14}}" generation=1062 namespace=my-namespace resourceVersion=751071572 rollout=app-name
...
2024-01-03T16:07:12+01:00 time="2024-01-03T15:07:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":13,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:07:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":13,\"updatedReplicas\":7}}" generation=1061 namespace=my-namespace resourceVersion=751071170 rollout=app-name
...
2024-01-03T16:07:12+01:00 time="2024-01-03T15:07:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":12,\"replicas\":12}}" generation=1061 namespace=my-namespace resourceVersion=751071151 rollout=app-name
...
2024-01-03T16:06:42+01:00 time="2024-01-03T15:06:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":11,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:42Z\",\"lastUpdateTime\":\"2024-01-03T15:06:42Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:42Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"updated replicas are still becoming available\",\"replicas\":11,\"updatedReplicas\":6}}" generation=1060 namespace=my-namespace resourceVersion=751070699 rollout=app-name
...
2024-01-03T16:06:42+01:00 time="2024-01-03T15:06:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":10,\"replicas\":10}}" generation=1060 namespace=my-namespace resourceVersion=751070688 rollout=app-name
Logs as a whole
I deleted all OPA Gatekeeper mutations which update the objects. Now we only have verifications. Keda is still in place and the behaviour is the same.
I am facing the same issue with:
v1.6.0
v2.7.2
2.11.2
Normal SuccessfulRescale 15m horizontal-pod-autoscaler New size: 18; reason: memory resource utilization (percentage of request) above target
Normal SuccessfulRescale 7m35s horizontal-pod-autoscaler New size: 20; reason: memory resource utilization (percentage of request) above target
Normal SuccessfulRescale 91s horizontal-pod-autoscaler New size: 22; reason: memory resource utilization (percentage of request) above target
@markh123 - were you able to figure out any workaround for this issue?
We're having the exact same issue. For a period of time, when the Rollout starts with memory autoscaling configured, the HPA sends an event saying New size: XX, reason: memory resource utilization (percentage of request) above target
is there any workaround for this?
I'm still facing this issue on v1.6.6
We are facing the same issue here. Are we sure this is an argo-rollouts issue and not a Keda issue? Should this issue be opened also on Keda's side? What can we do to help debug this problem?
Hello,
I think that KEDA isn't related at all as KEDA only exposes the metric to the HPA controller and the HPA controller operates over /scale
resource (and the original message uses CPU and memory, which aren't related with KEDA). IMHO the issue is related with rollouts controller as it's the responsible for updating the underlying replicasets.
Hi @zachaller , This is still a critical issue and is directly affecting the rollout process
Hi, i am experiencing the same. k8s version : 1.30 (eks) Here is what i have tried.... nothing fixed the problem.
more context
I would like to try forcing hpa to cacluate only the stable pods, like using ephemeral metadata (pod selector), but it is not currenctly supported neither in hpa nor keda...
$ kubectl get hpa -n <app_name> -w
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
hpa Rollout/<app_name> cpu: 34%/40% 30 500 62 69m
hpa Rollout/<app_name> cpu: 35%/40% 30 500 62 69m
hpa Rollout/<app_name> cpu: 34%/40% 30 500 66 70m
hpa Rollout/<app_name> cpu: 33%/40% 30 500 66 70m
hpa Rollout/<app_name> cpu: 33%/40% 30 500 66 70m
hpa Rollout/<app_name> cpu: 32%/40% 30 500 66 71m
hpa Rollout/<app_name> cpu: 33%/40% 30 500 66 71m
hpa Rollout/<app_name> cpu: 33%/40% 30 500 66 71m
hpa Rollout/<app_name> cpu: 33%/40% 30 500 66 71m
hpa Rollout/<app_name> cpu: 34%/40% 30 500 66 72m
hpa Rollout/<app_name> cpu: 32%/40% 30 500 66 72m
hpa Rollout/<app_name> cpu: 32%/40% 30 500 66 72m
hpa Rollout/<app_name> cpu: 31%/40% 30 500 66 72m
hpa Rollout/<app_name> cpu: 31%/40% 30 500 66 73m
hpa Rollout/<app_name> cpu: 31%/40% 30 500 66 73m
hpa Rollout/<app_name> cpu: 31%/40% 30 500 66 73m
hpa Rollout/<app_name> cpu: 31%/40% 30 500 66 73m
hpa Rollout/<app_name> cpu: 30%/40% 30 500 66 74m
hpa Rollout/<app_name> cpu: 28%/40% 30 500 66 74m
hpa Rollout/<app_name> cpu: 28%/40% 30 500 66 74m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 66 74m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 66 75m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 66 75m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 66 75m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 66 75m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 66 76m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 66 76m
hpa Rollout/<app_name> cpu: 24%/40% 30 500 66 76m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 66 76m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 66 77m
hpa Rollout/<app_name> cpu: 24%/40% 30 500 66 77m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 66 77m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 66 77m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 78m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 78m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 78m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 78m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 79m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 79m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 79m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 79m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 80m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 80m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 80m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 80m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 81m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 81m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 81m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 81m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 82m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 64 82m
hpa Rollout/<app_name> cpu: 27%/40% 30 500 64 82m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 82m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 83m
hpa Rollout/<app_name> cpu: 27%/40% 30 500 65 83m
hpa Rollout/<app_name> cpu: 28%/40% 30 500 65 83m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 83m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 84m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 84m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 84m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 84m
hpa Rollout/<app_name> cpu: 27%/40% 30 500 65 85m
hpa Rollout/<app_name> cpu: 26%/40% 30 500 65 85m
hpa Rollout/<app_name> cpu: 25%/40% 30 500 65 85m
hpa Rollout/<app_name> cpu: 24%/40% 30 500 65 85m
hpa Rollout/<app_name> cpu: 24%/40% 30 500 65 86m
hpa Rollout/<app_name> cpu: 39%/40% 30 500 71 86m
hpa Rollout/<app_name> cpu: 38%/40% 30 500 71 86m
hpa Rollout/<app_name> cpu: 39%/40% 30 500 71 86m
hpa Rollout/<app_name> cpu: 37%/40% 30 500 71 87m
Hi, we're facing the same issue here without using KEDA in this case, and with the HPA using memory based metrics, replicas scale to the max during canary releases.
Same as other people's reports:
Normal SuccessfulRescale 36m (x12 over 8d) horizontal-pod-autoscaler New size: 7; reason: memory resource utilization (percentage of request) above target
Normal SuccessfulRescale 33m (x11 over 8d) horizontal-pod-autoscaler New size: 8; reason: memory resource utilization (percentage of request) above target
Normal SuccessfulRescale 18m (x3 over 8d) horizontal-pod-autoscaler New size: 8; reason: All metrics below target
Normal SuccessfulRescale 2m39s (x12 over 8d) horizontal-pod-autoscaler New size: 9; reason: memory resource utilization (percentage of request) above target
Normal SuccessfulRescale 114s (x14 over 8d) horizontal-pod-autoscaler New size: 10; reason: memory resource utilization (percentage of request) above target
Is there anything that we can contribute to help solve this issue @zachaller ?
Thanks!
Checklist:
Describe the bug
We use canary deploys via argo rollouts to to deploy services. In our services that use the k8s horizontal pod autoscaler with scaling configured via memory limits (we don't see the same issue with CPU scaling) we see the stable replica set scale up to max replicas during each deploy and then scale back down after the deploy is complete.
Looking at the metrics reported for the service via both
kubectl describe hpa
andkubectl get hpa
during the scale ups I never see the metrics reported exceeding the limits nor do I see the metrics exceeding the limits in the corresponding prometheus metricsHowever, I still see HPA events scaling up the service due to memory:
The HPA configuration works as expected during normal operation and only seems to have issues during argo rollout deploys which is why I think this is likely a bug with how argo rollouts interacts with HPA.
Note that the replica count doesn't always go to max. If we increase the memory for the pods and/or increase the memory scaling limit we can decrease the replicas that are added during deployment. However, this solution isn't great as it adds additional cost to run machines with much more memory than we need just to reduce this problem.
To Reproduce
I haven't set up an isolated reproduction but I think all that is necessary is deploying a service with memory based HPA that operates at roughly 50% memory capacity with a 70% memory scaling limit. Then you can perform a canary deploy for that service and it should scale up during the deploy.
Expected behavior
I expect the stable replica set to not scale up during the deploy unless an increase in traffic/utilization necessitates the increase.
Screenshots
The below screenshot shows the replica count during a deploy. The green line is the stable set and the yellow line is the canary set. You can see how it scales up during the deployment and then back down afterwards.
Version
v1.5.1
Logs
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.