argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.79k stars 873 forks source link

HPA scales up stable replicaset to max when doing canary deploys #2857

Open markh123 opened 1 year ago

markh123 commented 1 year ago

Checklist:

Describe the bug

We use canary deploys via argo rollouts to to deploy services. In our services that use the k8s horizontal pod autoscaler with scaling configured via memory limits (we don't see the same issue with CPU scaling) we see the stable replica set scale up to max replicas during each deploy and then scale back down after the deploy is complete.

Looking at the metrics reported for the service via both kubectl describe hpa and kubectl get hpa during the scale ups I never see the metrics reported exceeding the limits nor do I see the metrics exceeding the limits in the corresponding prometheus metrics

Metrics:                                                  ( current / target )
  resource memory on pods  (as a percentage of request):  43% (486896088615m) / 70%
  resource cpu on pods  (as a percentage of request):     3% (43m) / 70% 

However, I still see HPA events scaling up the service due to memory:

Normal  SuccessfulRescale  2m58s (x8 over 2d20h)  horizontal-pod-autoscaler  New size: 13; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  11s (x16 over 2d21h)   horizontal-pod-autoscaler  New size: 15; reason: memory resource utilization (percentage of request) above target

The HPA configuration works as expected during normal operation and only seems to have issues during argo rollout deploys which is why I think this is likely a bug with how argo rollouts interacts with HPA.

Note that the replica count doesn't always go to max. If we increase the memory for the pods and/or increase the memory scaling limit we can decrease the replicas that are added during deployment. However, this solution isn't great as it adds additional cost to run machines with much more memory than we need just to reduce this problem.

To Reproduce

I haven't set up an isolated reproduction but I think all that is necessary is deploying a service with memory based HPA that operates at roughly 50% memory capacity with a 70% memory scaling limit. Then you can perform a canary deploy for that service and it should scale up during the deploy.

Expected behavior

I expect the stable replica set to not scale up during the deploy unless an increase in traffic/utilization necessitates the increase.

Screenshots

The below screenshot shows the replica count during a deploy. The green line is the stable set and the yellow line is the canary set. You can see how it scales up during the deployment and then back down afterwards.

Screen Shot 2023-06-23 at 2 12 06 PM

Version

v1.5.1

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.

jandersen-plaid commented 1 year ago

/remove stale (didn't work but same effect)

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.

hasan-tayyar-besik commented 10 months ago

A similar issue is happening with my setup. It only happens when a traffic control configuration is in place and HPA has memory-based autoscaling. I can only avoid this when I set the memory utilization threshold to a number that is unrealistically high, like 95% or 99%.

No matter what the average or maximum memory utilisation is during the rollout, the HPA reports New size: X; reason: memory resource utilization (percentage of request) above target and scales the stable and the canary to max only at the last step of the rollout.

This is my HPA (managed by Keda)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  labels:
    app.kubernetes.io/managed-by: keda-operator
    app.kubernetes.io/name: keda-hpa-xxx
    app.kubernetes.io/part-of: xxx
    app.kubernetes.io/version: 2.12.1
    scaledobject.keda.sh/name: xxx
  name: keda-hpa-xxx
  namespace: xxxxx
  ownerReferences:
    - apiVersion: keda.sh/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: ScaledObject
      name: xxx
spec:
  maxReplicas: 40
  metrics:
    - external:
        metric:
          name: s1-cron-....
          selector:
            matchLabels:
              scaledobject.keda.sh/name: xxx
        target:
          averageValue: '1'
          type: AverageValue
      type: External
    - resource:
        name: memory
        target:
          averageUtilization: 85
          type: Utilization
      type: Resource
    - resource:
        name: cpu
        target:
          averageUtilization: 50
          type: Utilization
      type: Resource
  minReplicas: 5
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: xxx
status:
  conditions:
    - lastTransitionTime: '2023-09-14T13:36:39Z'
      message: recommended size matches current size
      reason: ReadyForNewScale
      status: 'True'
      type: AbleToScale
    - lastTransitionTime: '2024-01-02T15:34:09Z'
      message: >-
        the HPA was able to successfully calculate a replica count from cpu
        resource utilization (percentage of request)
      reason: ValidMetricFound
      status: 'True'
      type: ScalingActive
    - lastTransitionTime: '2024-01-03T11:00:26Z'
      message: the desired count is within the acceptable range
      reason: DesiredWithinRange
      status: 'False'
      type: ScalingLimited
  currentMetrics:
    - external:
        current:
          averageValue: 200m
          value: '0'
        metric:
          name: s1-cron-....
          selector:
            matchLabels:
              scaledobject.keda.sh/name: ...
      type: External
    - resource:
        current:
          averageUtilization: 52
          averageValue: 275596902400m
        name: memory
      type: Resource
    - resource:
        current:
          averageUtilization: 49
          averageValue: 249m
        name: cpu
      type: Resource
  currentReplicas: 5
  desiredReplicas: 5
  lastScaleTime: '2024-01-03T10:40:37Z'

The scaling up continues until the max and after some time it goes down and rollout completes or I need to promote to full at the last step to avoid scaling max.

This is how my HPA looks during the last step:

image

during the scaling up at the last step this is what my HPA says

Scaling config:

    - resource:
        name: memory
        target:
          averageUtilization: 85
          type: Utilization

status:


status:
  conditions:
    - lastTransitionTime: '2023-09-14T13:36:39Z'
      message: >-
        recent recommendations were higher than current one, applying the
        highest recent recommendation
      reason: ScaleDownStabilized
      status: 'True'
      type: AbleToScale
    - lastTransitionTime: '2024-01-02T15:34:09Z'
      message: >-
        the HPA was able to successfully calculate a replica count from memory
        resource utilization (percentage of request)
      reason: ValidMetricFound
      status: 'True'
      type: ScalingActive
    - lastTransitionTime: '2024-01-03T11:00:26Z'
      message: the desired count is within the acceptable range
      reason: DesiredWithinRange
      status: 'False'
      type: ScalingLimited
  currentMetrics:
    - external:
        current:
          averageValue: 67m
          value: '0'
        metric:
          name: s1-cron-...
          selector:
            matchLabels:
              scaledobject.keda.sh/name: xxx
      type: External
    - resource:
        current:
          averageUtilization: 46
          averageValue: 244952268800m
        name: memory
      type: Resource
    - resource:
        current:
          averageUtilization: 3
          averageValue: 17m
        name: cpu
      type: Resource
  currentReplicas: 15
  desiredReplicas: 15
  lastScaleTime: '2024-01-03T12:33:13Z'

Right after the rollout completed the HPA starts scaling the replicaset down

image
hasan-tayyar-besik commented 10 months ago

I have upgraded my Argo Rollout helm chart from 2.32.2 to 2.34.0 and app version from 1.6.2 to 1.6.4. The issue persists.

hasan-tayyar-besik commented 10 months ago

After deleting all my gatekeeper policies I get the same issue.

In the Argo rollouts logs this was a bit confusing to me

2024-01-03T16:10:10+01:00   time="2024-01-03T15:10:10Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":11,\"availableReplicas\":11,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:10:08Z\",\"lastUpdateTime\":\"2024-01-03T15:10:08Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:10:10Z\",\"lastUpdateTime\":\"2024-01-03T15:10:10Z\",\"message\":\"Rollout is healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"True\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:10:10Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" has successfully progressed.\",\"reason\":\"NewReplicaSetAvailable\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":11,\"replicas\":11}}" generation=1065 namespace=my-namespace resourceVersion=751073974 rollout=app-name

...
2024-01-03T16:09:47+01:00   time="2024-01-03T15:09:47Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":13,\"availableReplicas\":12,\"readyReplicas\":12,\"replicas\":13}}" generation=1065 namespace=my-namespace resourceVersion=751073492 rollout=app-name

...
2024-01-03T16:09:37+01:00   time="2024-01-03T15:09:37Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":14,\"availableReplicas\":12,\"readyReplicas\":12,\"replicas\":14}}" generation=1065 namespace=my-namespace resourceVersion=751073328 rollout=app-name

...
2024-01-03T16:09:12+01:00   time="2024-01-03T15:09:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":20,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:09:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":20,\"updatedReplicas\":11}}" generation=1065 namespace=my-namespace resourceVersion=751072932 rollout=app-name

...
2024-01-03T16:09:12+01:00   time="2024-01-03T15:09:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":19,\"replicas\":19}}" generation=1065 namespace=my-namespace resourceVersion=751072919 rollout=app-name

...
2024-01-03T16:08:42+01:00   time="2024-01-03T15:08:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":18,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:08:42Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":18,\"updatedReplicas\":10}}" generation=1064 namespace=my-namespace resourceVersion=751072501 rollout=app-name

...
2024-01-03T16:08:12+01:00   time="2024-01-03T15:08:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":17,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:08:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":17,\"updatedReplicas\":9}}" generation=1063 namespace=my-namespace resourceVersion=751072060 rollout=app-name

...
2024-01-03T16:08:12+01:00   time="2024-01-03T15:08:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":16,\"replicas\":16}}" generation=1063 namespace=my-namespace resourceVersion=751072045 rollout=app-name

...
2024-01-03T16:07:42+01:00   time="2024-01-03T15:07:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":15,\"message\":\"waiting for all steps to complete\",\"replicas\":15,\"updatedReplicas\":8}}" generation=1062 namespace=my-namespace resourceVersion=751071598 rollout=app-name

...
2024-01-03T16:07:42+01:00   time="2024-01-03T15:07:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":14,\"replicas\":14}}" generation=1062 namespace=my-namespace resourceVersion=751071572 rollout=app-name

...
2024-01-03T16:07:12+01:00   time="2024-01-03T15:07:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":13,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:07:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":13,\"updatedReplicas\":7}}" generation=1061 namespace=my-namespace resourceVersion=751071170 rollout=app-name

...
2024-01-03T16:07:12+01:00   time="2024-01-03T15:07:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":12,\"replicas\":12}}" generation=1061 namespace=my-namespace resourceVersion=751071151 rollout=app-name

...
2024-01-03T16:06:42+01:00   time="2024-01-03T15:06:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":11,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:42Z\",\"lastUpdateTime\":\"2024-01-03T15:06:42Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:42Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"updated replicas are still becoming available\",\"replicas\":11,\"updatedReplicas\":6}}" generation=1060 namespace=my-namespace resourceVersion=751070699 rollout=app-name

...
2024-01-03T16:06:42+01:00   time="2024-01-03T15:06:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":10,\"replicas\":10}}" generation=1060 namespace=my-namespace resourceVersion=751070688 rollout=app-name

Logs as a whole

Explore-logs-2024-01-03 16_12_34.txt

hasan-tayyar-besik commented 10 months ago

I deleted all OPA Gatekeeper mutations which update the objects. Now we only have verifications. Keda is still in place and the behaviour is the same.

hansel-christopher1 commented 6 months ago

I am facing the same issue with:

  Normal  SuccessfulRescale  15m    horizontal-pod-autoscaler  New size: 18; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  7m35s  horizontal-pod-autoscaler  New size: 20; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  91s    horizontal-pod-autoscaler  New size: 22; reason: memory resource utilization (percentage of request) above target

@markh123 - were you able to figure out any workaround for this issue?

atmcarmo commented 4 months ago

We're having the exact same issue. For a period of time, when the Rollout starts with memory autoscaling configured, the HPA sends an event saying New size: XX, reason: memory resource utilization (percentage of request) above target

is there any workaround for this?

laivu266 commented 4 months ago

I'm still facing this issue on v1.6.6

diogofilipe098 commented 3 months ago

We are facing the same issue here. Are we sure this is an argo-rollouts issue and not a Keda issue? Should this issue be opened also on Keda's side? What can we do to help debug this problem?

JorTurFer commented 3 months ago

Hello, I think that KEDA isn't related at all as KEDA only exposes the metric to the HPA controller and the HPA controller operates over /scale resource (and the original message uses CPU and memory, which aren't related with KEDA). IMHO the issue is related with rollouts controller as it's the responsible for updating the underlying replicasets.

laivu266 commented 2 months ago

Hi @zachaller , This is still a critical issue and is directly affecting the rollout process

yijeong commented 2 months ago

Hi, i am experiencing the same. k8s version : 1.30 (eks) Here is what i have tried.... nothing fixed the problem.

more context

I would like to try forcing hpa to cacluate only the stable pods, like using ephemeral metadata (pod selector), but it is not currenctly supported neither in hpa nor keda...

$ kubectl get hpa -n <app_name>  -w
NAME   REFERENCE                     TARGETS        MINPODS   MAXPODS   REPLICAS   AGE
hpa    Rollout/<app_name>   cpu: 34%/40%   30        500       62         69m  
hpa    Rollout/<app_name>   cpu: 35%/40%   30        500       62         69m  
hpa    Rollout/<app_name>   cpu: 34%/40%   30        500       66         70m  
hpa    Rollout/<app_name>   cpu: 33%/40%   30        500       66         70m  
hpa    Rollout/<app_name>   cpu: 33%/40%   30        500       66         70m  
hpa    Rollout/<app_name>   cpu: 32%/40%   30        500       66         71m  
hpa    Rollout/<app_name>   cpu: 33%/40%   30        500       66         71m  
hpa    Rollout/<app_name>   cpu: 33%/40%   30        500       66         71m  
hpa    Rollout/<app_name>   cpu: 33%/40%   30        500       66         71m  
hpa    Rollout/<app_name>   cpu: 34%/40%   30        500       66         72m  
hpa    Rollout/<app_name>   cpu: 32%/40%   30        500       66         72m  
hpa    Rollout/<app_name>   cpu: 32%/40%   30        500       66         72m  
hpa    Rollout/<app_name>   cpu: 31%/40%   30        500       66         72m  
hpa    Rollout/<app_name>   cpu: 31%/40%   30        500       66         73m  
hpa    Rollout/<app_name>   cpu: 31%/40%   30        500       66         73m  
hpa    Rollout/<app_name>   cpu: 31%/40%   30        500       66         73m  
hpa    Rollout/<app_name>   cpu: 31%/40%   30        500       66         73m  
hpa    Rollout/<app_name>   cpu: 30%/40%   30        500       66         74m  
hpa    Rollout/<app_name>   cpu: 28%/40%   30        500       66         74m  
hpa    Rollout/<app_name>   cpu: 28%/40%   30        500       66         74m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       66         74m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         75m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         75m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         75m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         75m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         76m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         76m  
hpa    Rollout/<app_name>   cpu: 24%/40%   30        500       66         76m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         76m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       66         77m  
hpa    Rollout/<app_name>   cpu: 24%/40%   30        500       66         77m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         77m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       66         77m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         78m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         78m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         78m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         78m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         79m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         79m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         79m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         79m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         80m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         80m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         80m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         80m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         81m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         81m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         81m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         81m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         82m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       64         82m  
hpa    Rollout/<app_name>   cpu: 27%/40%   30        500       64         82m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         82m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         83m  
hpa    Rollout/<app_name>   cpu: 27%/40%   30        500       65         83m  
hpa    Rollout/<app_name>   cpu: 28%/40%   30        500       65         83m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         83m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         84m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         84m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         84m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         84m  
hpa    Rollout/<app_name>   cpu: 27%/40%   30        500       65         85m  
hpa    Rollout/<app_name>   cpu: 26%/40%   30        500       65         85m  
hpa    Rollout/<app_name>   cpu: 25%/40%   30        500       65         85m  
hpa    Rollout/<app_name>   cpu: 24%/40%   30        500       65         85m  
hpa    Rollout/<app_name>   cpu: 24%/40%   30        500       65         86m  
hpa    Rollout/<app_name>   cpu: 39%/40%   30        500       71         86m  
hpa    Rollout/<app_name>   cpu: 38%/40%   30        500       71         86m  
hpa    Rollout/<app_name>   cpu: 39%/40%   30        500       71         86m  
hpa    Rollout/<app_name>   cpu: 37%/40%   30        500       71         87m                                                                                
rtemperini commented 1 week ago

Hi, we're facing the same issue here without using KEDA in this case, and with the HPA using memory based metrics, replicas scale to the max during canary releases.

Same as other people's reports:

  Normal  SuccessfulRescale  36m (x12 over 8d)    horizontal-pod-autoscaler  New size: 7; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  33m (x11 over 8d)    horizontal-pod-autoscaler  New size: 8; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  18m (x3 over 8d)     horizontal-pod-autoscaler  New size: 8; reason: All metrics below target
  Normal  SuccessfulRescale  2m39s (x12 over 8d)  horizontal-pod-autoscaler  New size: 9; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  114s (x14 over 8d)   horizontal-pod-autoscaler  New size: 10; reason: memory resource utilization (percentage of request) above target

Is there anything that we can contribute to help solve this issue @zachaller ?

Thanks!