HPA scales up stable replicaset to max when doing canary deploys

Checklist:

[ ] I've included steps to reproduce the bug.
[ ] I've included the version of argo rollouts.

Describe the bug

We use canary deploys via argo rollouts to to deploy services. In our services that use the k8s horizontal pod autoscaler with scaling configured via memory limits (we don't see the same issue with CPU scaling) we see the stable replica set scale up to max replicas during each deploy and then scale back down after the deploy is complete.

Looking at the metrics reported for the service via both kubectl describe hpa and kubectl get hpa during the scale ups I never see the metrics reported exceeding the limits nor do I see the metrics exceeding the limits in the corresponding prometheus metrics

Metrics:                                                  ( current / target )
  resource memory on pods  (as a percentage of request):  43% (486896088615m) / 70%
  resource cpu on pods  (as a percentage of request):     3% (43m) / 70%

However, I still see HPA events scaling up the service due to memory:

Normal  SuccessfulRescale  2m58s (x8 over 2d20h)  horizontal-pod-autoscaler  New size: 13; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  11s (x16 over 2d21h)   horizontal-pod-autoscaler  New size: 15; reason: memory resource utilization (percentage of request) above target

The HPA configuration works as expected during normal operation and only seems to have issues during argo rollout deploys which is why I think this is likely a bug with how argo rollouts interacts with HPA.

Note that the replica count doesn't always go to max. If we increase the memory for the pods and/or increase the memory scaling limit we can decrease the replicas that are added during deployment. However, this solution isn't great as it adds additional cost to run machines with much more memory than we need just to reduce this problem.

To Reproduce

I haven't set up an isolated reproduction but I think all that is necessary is deploying a service with memory based HPA that operates at roughly 50% memory capacity with a 70% memory scaling limit. Then you can perform a canary deploy for that service and it should scale up during the deploy.

Expected behavior

I expect the stable replica set to not scale up during the deploy unless an increase in traffic/utilization necessitates the increase.

Screenshots

The below screenshot shows the replica count during a deploy. The green line is the stable set and the yellow line is the canary set. You can see how it scales up during the deployment and then back down afterwards.

Screen Shot 2023-06-23 at 2 12 06 PM

Version

v1.5.1

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

This issue is stale because it has been open 60 days with no activity.

/remove stale (didn't work but same effect)

This issue is stale because it has been open 60 days with no activity.

ArgoCd version : 1.6.2
I have Keda to manage HPA (most likely not relevant with the issue)
I also have OPA Gatekeeper mutations (it might be relevant, I will try to share the logs)

A similar issue is happening with my setup. It only happens when a traffic control configuration is in place and HPA has memory-based autoscaling. I can only avoid this when I set the memory utilization threshold to a number that is unrealistically high, like 95% or 99%.

No matter what the average or maximum memory utilisation is during the rollout, the HPA reports New size: X; reason: memory resource utilization (percentage of request) above target and scales the stable and the canary to max only at the last step of the rollout.

This is my HPA (managed by Keda)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  labels:
    app.kubernetes.io/managed-by: keda-operator
    app.kubernetes.io/name: keda-hpa-xxx
    app.kubernetes.io/part-of: xxx
    app.kubernetes.io/version: 2.12.1
    scaledobject.keda.sh/name: xxx
  name: keda-hpa-xxx
  namespace: xxxxx
  ownerReferences:
    - apiVersion: keda.sh/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: ScaledObject
      name: xxx
spec:
  maxReplicas: 40
  metrics:
    - external:
        metric:
          name: s1-cron-....
          selector:
            matchLabels:
              scaledobject.keda.sh/name: xxx
        target:
          averageValue: '1'
          type: AverageValue
      type: External
    - resource:
        name: memory
        target:
          averageUtilization: 85
          type: Utilization
      type: Resource
    - resource:
        name: cpu
        target:
          averageUtilization: 50
          type: Utilization
      type: Resource
  minReplicas: 5
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: xxx
status:
  conditions:
    - lastTransitionTime: '2023-09-14T13:36:39Z'
      message: recommended size matches current size
      reason: ReadyForNewScale
      status: 'True'
      type: AbleToScale
    - lastTransitionTime: '2024-01-02T15:34:09Z'
      message: >-
        the HPA was able to successfully calculate a replica count from cpu
        resource utilization (percentage of request)
      reason: ValidMetricFound
      status: 'True'
      type: ScalingActive
    - lastTransitionTime: '2024-01-03T11:00:26Z'
      message: the desired count is within the acceptable range
      reason: DesiredWithinRange
      status: 'False'
      type: ScalingLimited
  currentMetrics:
    - external:
        current:
          averageValue: 200m
          value: '0'
        metric:
          name: s1-cron-....
          selector:
            matchLabels:
              scaledobject.keda.sh/name: ...
      type: External
    - resource:
        current:
          averageUtilization: 52
          averageValue: 275596902400m
        name: memory
      type: Resource
    - resource:
        current:
          averageUtilization: 49
          averageValue: 249m
        name: cpu
      type: Resource
  currentReplicas: 5
  desiredReplicas: 5
  lastScaleTime: '2024-01-03T10:40:37Z'

The scaling up continues until the max and after some time it goes down and rollout completes or I need to promote to full at the last step to avoid scaling max.

This is how my HPA looks during the last step:

during the scaling up at the last step this is what my HPA says

Scaling config:

    - resource:
        name: memory
        target:
          averageUtilization: 85
          type: Utilization

status:


status:
  conditions:
    - lastTransitionTime: '2023-09-14T13:36:39Z'
      message: >-
        recent recommendations were higher than current one, applying the
        highest recent recommendation
      reason: ScaleDownStabilized
      status: 'True'
      type: AbleToScale
    - lastTransitionTime: '2024-01-02T15:34:09Z'
      message: >-
        the HPA was able to successfully calculate a replica count from memory
        resource utilization (percentage of request)
      reason: ValidMetricFound
      status: 'True'
      type: ScalingActive
    - lastTransitionTime: '2024-01-03T11:00:26Z'
      message: the desired count is within the acceptable range
      reason: DesiredWithinRange
      status: 'False'
      type: ScalingLimited
  currentMetrics:
    - external:
        current:
          averageValue: 67m
          value: '0'
        metric:
          name: s1-cron-...
          selector:
            matchLabels:
              scaledobject.keda.sh/name: xxx
      type: External
    - resource:
        current:
          averageUtilization: 46
          averageValue: 244952268800m
        name: memory
      type: Resource
    - resource:
        current:
          averageUtilization: 3
          averageValue: 17m
        name: cpu
      type: Resource
  currentReplicas: 15
  desiredReplicas: 15
  lastScaleTime: '2024-01-03T12:33:13Z'

Right after the rollout completed the HPA starts scaling the replicaset down

I have upgraded my Argo Rollout helm chart from 2.32.2 to 2.34.0 and app version from 1.6.2 to 1.6.4. The issue persists.

After deleting all my gatekeeper policies I get the same issue.

In the Argo rollouts logs this was a bit confusing to me

2024-01-03T16:10:10+01:00   time="2024-01-03T15:10:10Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":11,\"availableReplicas\":11,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:10:08Z\",\"lastUpdateTime\":\"2024-01-03T15:10:08Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:10:10Z\",\"lastUpdateTime\":\"2024-01-03T15:10:10Z\",\"message\":\"Rollout is healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"True\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:10:10Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" has successfully progressed.\",\"reason\":\"NewReplicaSetAvailable\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":11,\"replicas\":11}}" generation=1065 namespace=my-namespace resourceVersion=751073974 rollout=app-name

...
2024-01-03T16:09:47+01:00   time="2024-01-03T15:09:47Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":13,\"availableReplicas\":12,\"readyReplicas\":12,\"replicas\":13}}" generation=1065 namespace=my-namespace resourceVersion=751073492 rollout=app-name

...
2024-01-03T16:09:37+01:00   time="2024-01-03T15:09:37Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":14,\"availableReplicas\":12,\"readyReplicas\":12,\"replicas\":14}}" generation=1065 namespace=my-namespace resourceVersion=751073328 rollout=app-name

...
2024-01-03T16:09:12+01:00   time="2024-01-03T15:09:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":20,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:09:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":20,\"updatedReplicas\":11}}" generation=1065 namespace=my-namespace resourceVersion=751072932 rollout=app-name

...
2024-01-03T16:09:12+01:00   time="2024-01-03T15:09:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":19,\"replicas\":19}}" generation=1065 namespace=my-namespace resourceVersion=751072919 rollout=app-name

...
2024-01-03T16:08:42+01:00   time="2024-01-03T15:08:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":18,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:08:42Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":18,\"updatedReplicas\":10}}" generation=1064 namespace=my-namespace resourceVersion=751072501 rollout=app-name

...
2024-01-03T16:08:12+01:00   time="2024-01-03T15:08:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":17,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:08:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":17,\"updatedReplicas\":9}}" generation=1063 namespace=my-namespace resourceVersion=751072060 rollout=app-name

...
2024-01-03T16:08:12+01:00   time="2024-01-03T15:08:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":16,\"replicas\":16}}" generation=1063 namespace=my-namespace resourceVersion=751072045 rollout=app-name

...
2024-01-03T16:07:42+01:00   time="2024-01-03T15:07:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":15,\"message\":\"waiting for all steps to complete\",\"replicas\":15,\"updatedReplicas\":8}}" generation=1062 namespace=my-namespace resourceVersion=751071598 rollout=app-name

...
2024-01-03T16:07:42+01:00   time="2024-01-03T15:07:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":14,\"replicas\":14}}" generation=1062 namespace=my-namespace resourceVersion=751071572 rollout=app-name

...
2024-01-03T16:07:12+01:00   time="2024-01-03T15:07:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":13,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:07:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":13,\"updatedReplicas\":7}}" generation=1061 namespace=my-namespace resourceVersion=751071170 rollout=app-name

...
2024-01-03T16:07:12+01:00   time="2024-01-03T15:07:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":12,\"replicas\":12}}" generation=1061 namespace=my-namespace resourceVersion=751071151 rollout=app-name

...
2024-01-03T16:06:42+01:00   time="2024-01-03T15:06:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":11,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:42Z\",\"lastUpdateTime\":\"2024-01-03T15:06:42Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:42Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"updated replicas are still becoming available\",\"replicas\":11,\"updatedReplicas\":6}}" generation=1060 namespace=my-namespace resourceVersion=751070699 rollout=app-name

...
2024-01-03T16:06:42+01:00   time="2024-01-03T15:06:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":10,\"replicas\":10}}" generation=1060 namespace=my-namespace resourceVersion=751070688 rollout=app-name

Logs as a whole

Explore-logs-2024-01-03 16_12_34.txt

I deleted all OPA Gatekeeper mutations which update the objects. Now we only have verifications. Keda is still in place and the behaviour is the same.

I am facing the same issue with:

Argo Rollouts: v1.6.0
ArgoCD: v2.7.2
KEDA: 2.11.2

  Normal  SuccessfulRescale  15m    horizontal-pod-autoscaler  New size: 18; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  7m35s  horizontal-pod-autoscaler  New size: 20; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  91s    horizontal-pod-autoscaler  New size: 22; reason: memory resource utilization (percentage of request) above target

@markh123 - were you able to figure out any workaround for this issue?

We're having the exact same issue. For a period of time, when the Rollout starts with memory autoscaling configured, the HPA sends an event saying New size: XX, reason: memory resource utilization (percentage of request) above target

is there any workaround for this?

I'm still facing this issue on v1.6.6

We are facing the same issue here. Are we sure this is an argo-rollouts issue and not a Keda issue? Should this issue be opened also on Keda's side? What can we do to help debug this problem?

Hello, I think that KEDA isn't related at all as KEDA only exposes the metric to the HPA controller and the HPA controller operates over /scale resource (and the original message uses CPU and memory, which aren't related with KEDA). IMHO the issue is related with rollouts controller as it's the responsible for updating the underlying replicasets.

Hi @zachaller , This is still a critical issue and is directly affecting the rollout process

Hi, i am experiencing the same. k8s version : 1.30 (eks) Here is what i have tried.... nothing fixed the problem.

remove memory setting from hpa : same happens with cpu setting (scale out during canary deployment with underutilized percentage as below hpa log)
rollout helm version upgrade from 2.37.3 > 2.37.6 (latest) : no effect
node-metric helm version upgrade from : 3.12.0 > 3.12.1 (latest) : no effect
node-metric scale up : no effect
change hpa metric from resource to conatiner resource to exclude istio-proxy sidecar : no effect
introduce keda : no effect

more context

it only happens when canary deployment. all the other times, it scales out when the average cpu utilization actually exceed the target.
i have about 100 applications (=namespace), and it only happens in like 15 of them.

I would like to try forcing hpa to cacluate only the stable pods, like using ephemeral metadata (pod selector), but it is not currenctly supported neither in hpa nor keda...

$ kubectl get hpa -n <app_name>  -w
NAME   REFERENCE                     TARGETS        MINPODS   MAXPODS   REPLICAS   AGE
hpa    Rollout/<app_name>   cpu: 34%/40%   30        500       62         69m  
hpa    Rollout/<app_name>   cpu: 35%/40%   30        500       62         69m  
hpa    Rollout/<app_name>   cpu: 34%/40%   30        500       66         70m  
hpa    Rollout/<app_name>   cpu: 33%/40%   30        500       66         70m  
hpa    Rollout/<app_name>   cpu: 33%/40%   30        500       66         70m  
hpa    Rollout/<app_name>   cpu: 32%/40%   30        500       66         71m  
hpa    Rollout/<app_name>   cpu: 33%/40%   30        500       66         71m  
hpa    Rollout/<app_name>   cpu: 33%/40%   30        500       66         71m  
hpa    Rollout/<app_name>   cpu: 33%/40%   30        500       66         71m  
hpa    Rollout/<app_name>   cpu: 34%/40%   30        500       66         72m  
hpa    Rollout/<app_name>   cpu: 32%/40%   30        500       66         72m  
hpa    Rollout/<app_name>   cpu: 32%/40%   30        500       66         72m  
hpa    Rollout/<app_name>   cpu: 31%/40%   30        500       66         72m  
hpa    Rollout/<app_name>   cpu: 31%/40%   30        500       66         73m  
hpa    Rollout/<app_name>   cpu: 31%/40%   30        500       66         73m  
hpa    Rollout/<app_name>   cpu: 31%/40%   30        500       66         73m  
hpa    Rollout/<app_name>   cpu: 31%/40%   30        500       66         73m  
hpa    Rollout/<app_name>   cpu: 30%/40%   30        500       66         74m  
hpa    Rollout/<app_name>   cpu: 28%/40%   30        500       66         74m  
hpa    Rollout/<app_name>   cpu: 28%/40%   30        500       66         74m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       66         74m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         75m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         75m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         75m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         75m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         76m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         76m  
hpa    Rollout/<app_name>   cpu: 24%/40%   30        500       66         76m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         76m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       66         77m  
hpa    Rollout/<app_name>   cpu: 24%/40%   30        500       66         77m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         77m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         77m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         78m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         78m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         78m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         78m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         79m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         79m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         79m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         79m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         80m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         80m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         80m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         80m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         81m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         81m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         81m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         81m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         82m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       64         82m  
hpa    Rollout/<app_name>   cpu: 27%/40%   30        500       64         82m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         82m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         83m  
hpa    Rollout/<app_name>   cpu: 27%/40%   30        500       65         83m  
hpa    Rollout/<app_name>   cpu: 28%/40%   30        500       65         83m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         83m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         84m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         84m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         84m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         84m  
hpa    Rollout/<app_name>   cpu: 27%/40%   30        500       65         85m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         85m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         85m  
hpa    Rollout/<app_name>   cpu: 24%/40%   30        500       65         85m  
hpa    Rollout/<app_name>   cpu: 24%/40%   30        500       65         86m  
hpa    Rollout/<app_name>   cpu: 39%/40%   30        500       71         86m  
hpa    Rollout/<app_name>   cpu: 38%/40%   30        500       71         86m  
hpa    Rollout/<app_name>   cpu: 39%/40%   30        500       71         86m  
hpa    Rollout/<app_name>   cpu: 37%/40%   30        500       71         87m

Hi, we're facing the same issue here without using KEDA in this case, and with the HPA using memory based metrics, replicas scale to the max during canary releases.

Same as other people's reports:

  Normal  SuccessfulRescale  36m (x12 over 8d)    horizontal-pod-autoscaler  New size: 7; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  33m (x11 over 8d)    horizontal-pod-autoscaler  New size: 8; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  18m (x3 over 8d)     horizontal-pod-autoscaler  New size: 8; reason: All metrics below target
  Normal  SuccessfulRescale  2m39s (x12 over 8d)  horizontal-pod-autoscaler  New size: 9; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  114s (x14 over 8d)   horizontal-pod-autoscaler  New size: 10; reason: memory resource utilization (percentage of request) above target

Is there anything that we can contribute to help solve this issue @zachaller ?

Thanks!

argoproj / argo-rollouts

HPA scales up stable replicaset to max when doing canary deploys #2857