caas-team / py-kube-downscaler

Scale down / "pause" Kubernetes workload (Deployments, StatefulSets, and/or HorizontalPodAutoscalers and CronJobs too !) during non-work hours.
GNU General Public License v3.0
34 stars 12 forks source link

Downscaling with Keda ScaledObjects not working #96

Closed cecchcc closed 3 weeks ago

cecchcc commented 1 month ago

Issue

Hello,

We deployed py-kube-downscaler with Helm on our cluster and wanted to use it with Keda ScaledObjects. We annotated the ScaledObject with downscaler/downtime-replicas and downscaler/uptime and we also tried to use the annotation downscaler/exclude: "true" on the deployment like it is written on the doc. But it does not have any effect, the number of pods are not downscaling.

When deploying py-kube-downscaler we launched it with - '--include-resources=deployments,statefulsets,scaledobjects'.

Are we missing something?

samuel-esp commented 1 month ago

Hi @cecchcc could you share:

cecchcc commented 1 month ago

Hello @samuel-esp, here is the different informations Our deployment:

# Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend-api
  namespace: orange
  annotations:
    downscaler/exclude: 'true'
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/version: 0.1.0
  template:
    metadata:
      labels:
        app.kubernetes.io/version: 0.1.0
        heritage: Helm
    spec:
      containers:
        - name: php
          image: ******
        - name: http
          image: ******

---

# Scaled Object

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    downscaler/downtime-replicas: '1'
    downscaler/uptime: Mon-Fri 18:00-22:00 Europe/Paris
  name: backend-api
  namespace: orange
spec:
  cooldownPeriod: 300
  maxReplicaCount: 10
  minReplicaCount: 1
  pollingInterval: 30
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: backend-api
  triggers:
    - metadata:
        metricName: active_processes
        query: >
          avg((sum(active_processes{job="orange"}) by
          (kubernetes_pod_name) *100) /
          sum(total_processes{job="orange"}) by
          (kubernetes_pod_name))
        serverAddress: *****
        threshold: '50'
      type: prometheus

The information of Kubedownscaler

# Kubedownscaler configmap

apiVersion: v1
kind: ConfigMap
metadata:
  name: py-kube-downscaler
  namespace: kube-downscaler
data:
  EXCLUDE_NAMESPACES: py-kube-downscaler,kube-downscaler,kube-system

---

# Kubedownscaler 

apiVersion: apps/v1
kind: Deployment
metadata:
  name: py-kube-downscaler
  namespace: kube-downscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      application: py-kube-downscaler
  template:
    metadata:
      labels:
        application: py-kube-downscaler   
    spec:
      containers:
        - name: py-kube-downscaler
          image: ghcr.io/caas-team/py-kube-downscaler:24.8.0
          args:
            - '--interval=60'
            - '--include-resources=deployments,statefulsets,scaledobjects'
            - '--debug'
          envFrom:
            - configMapRef:
                name: py-kube-downscaler
                optional: true
          resources:
            limits:
              cpu: 500m
              memory: 900Mi
            requests:
              cpu: 200m
              memory: 300Mi
          securityContext:
            capabilities:
              drop:
                - ALL
            privileged: false
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
      serviceAccountName: kube-downscaler-py-kube-downscaler
      serviceAccount: kube-downscaler-py-kube-downscaler

Here is a sample of the log concerning our deployment

2024-09-17 14:20:11,420 DEBUG: ScaledObject orange/backend-api has 1 replicas (original: None, uptime: Mon-Fri 18:00-22:00 Europe/Paris)
2024-09-17 14:21:17,520 DEBUG: Deployment orange/backend-api was excluded
2024-09-17 14:21:19,669 DEBUG: ScaledObject orange/backend-api has 1 replicas (original: None, uptime: Mon-Fri 18:00-22:00 Europe/Paris)
2024-09-17 14:22:25,452 DEBUG: Deployment orange/backend-api was excluded

Kube Downscaler does not give us more logs than that and it always say there is 1 replica even if there is more, for example here, we have 2 replicas for our backend-api but it still indicates 1.

samuel-esp commented 1 month ago

Hi @cecchcc thank you for your answer, from what I'm understanding you are trying to downscale the deployment in this time interval Mon-Fri 18:00-22:00 Europe/Paris. First of all you should delete this annotation from the deployment

keeping the annotation above means the Deployment will be excluded from downscaling, so you should delete that and replace with:

For Keda Scaled Object this annotation downscaler/downtime-replicas shouldn't be supported in the current release (but I will include it in the next release). So you should only keep this

Can you try to test this configuration? (changing the time interval of course to now)


Looking from the time inside the logs, it is correct the workloads are not downscaled if you are currently targeting this time interval Mon-Fri 18:00-22:00 Europe/Paris

2024-09-17 14:20:11,420 DEBUG: ScaledObject orange/backend-api has 1 replicas (original: None, uptime: Mon-Fri 18:00-22:00 Europe/Paris)
2024-09-17 14:21:17,520 DEBUG: Deployment orange/backend-api was excluded
2024-09-17 14:21:19,669 DEBUG: ScaledObject orange/backend-api has 1 replicas (original: None, uptime: Mon-Fri 18:00-22:00 Europe/Paris)
2024-09-17 14:22:25,452 DEBUG: Deployment orange/backend-api was excluded

Kube Downscaler does not give us more logs than that and it always say there is 1 replica even if there is more, for example here, we have 2 replicas for our backend-api but it still indicates 1.

I see the concern for this log. Unfortunately the message is a "not elegant" way to say "Scaled Object is not downscaled yet", I will try to address this point with a new log message for keda in the next release

cecchcc commented 1 month ago

Hi @samuel-esp , thank you for your answer.

We tried with and without the downscaler/exclude: 'true' and it does not scale in both cases. If I understand well, we have to add the annotation downscaler/uptime: Mon-Fri 18:00-22:00 Europe/Paris in the deployment and in the Keda ScaledObject manifests? How can we specify the number of replicas if the downscaler/downtime-replicas is not available?

Looking from the time inside the logs, it is correct the workloads are not downscaled if you are currently targeting this time interval Mon-Fri 18:00-22:00 Europe/Paris

I don't understand why as we specified here the uptime of the application and not the downtime, so we should have 2 replicas between 18:00-22:00 and between 22:00 to 18:00 only 1 replica. But currently, we still have 2 no matter what the time is.

samuel-esp commented 1 month ago

I don't understand why as we specified here the uptime of the application and not the downtime, so we should have 2 replicas between 18:00-22:00 and between 22:00 to 18:00 only 1 replica. But currently, we still have 2 no matter what the time is.

Sorry, I just read the annotation in the wrong way (downscaler/downtime), so you are correct both resources should be downscaled if you are using downscaler/uptime in that interval. I'll try to check with a test cluster, replicating your situation

How can we specify the number of replicas if the downscaler/downtime-replicas is not available?

downscaler/downtime-replicas is supported on Deployments but not supported on Keda Scaled Object, so the behavior I'm expecting to see is:

  1. Deployment gets downscaled to downscaler/downtime-replicas value
  2. Scaled Object gets paused to 0 (because currently you can't specify a different value with downscaler/downtime-replicas ), so also the deployment will be scaled to 0 because it is controlled by the Scaled Object
samuel-esp commented 1 month ago

You should be able to make it work using this configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend-api
  namespace: orange
  annotations:
    downscaler/downtime-replicas: '1'
    downscaler/uptime: Mon-Fri 18:00-22:00 Europe/Paris
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/version: 0.1.0
  template:
    metadata:
      labels:
        app.kubernetes.io/version: 0.1.0
        heritage: Helm
    spec:
      containers:
        - name: php
          image: nginx
        - name: http
          image: nginx
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    downscaler/uptime: Mon-Fri 18:00-22:00 Europe/Paris
  name: backend-api
  namespace: orange
spec:
  cooldownPeriod: 300
  maxReplicaCount: 10
  minReplicaCount: 1
  pollingInterval: 30
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: backend-api
  triggers:
    - metadata:
        metricName: active_processes
        query: >
          avg((sum(active_processes{job="orange"}) by
          (kubernetes_pod_name) *100) /
          sum(total_processes{job="orange"}) by
          (kubernetes_pod_name))
        threshold: '50'
      type: prometheus

The good news is that the behavior I wasn't expecting didn't happen

the behavior I'm expecting to see is:

  1. Deployment gets downscaled to downscaler/downtime-replicas value
  2. Scaled Object gets paused to 0 (because currently you can't specify a different value with downscaler/downtime-replicas ), so also the deployment will be scaled to 0 because it is controlled by the Scaled Object

So it is sufficient to specify downscaler/downtime-replicas: '1' only in the deployment in order to achieve your configuration

Let me know if it helped you! I'll try to clarify logs and behaviors in a PR for the next release. Thank you a lot for raising this question

cecchcc commented 1 month ago

I tried to add the annotations on the deployment and the scaledobject but it is still not functionning well. In fact, the deployment scales down but Keda scales it up because it does not match its desired number of pods.

Here are the logs and we noticed an error with the autoscaling.keda.sh/paused-replicas annotation


2024-09-18 13:55:23,675 DEBUG: Deployment orange/backend-api has 2 replicas (original: None, uptime: Mon-Fri 18:00-22:00 Europe/Paris)
2024-09-18 13:55:23,675 INFO: Scaling down Deployment orange/backend-api from 2 to 1 replicas (uptime: Mon-Fri 18:00-22:00 Europe/Paris, downtime: never)
2024-09-18 13:55:25,693 DEBUG: ScaledObject orange/backend-api has 1 replicas (original: None, uptime: Mon-Fri 18:00-22:00 Europe/Paris)
2024-09-18 13:55:25,693 ERROR: Failed to process ScaledObject orange/backend-api: 'autoscaling.keda.sh/paused-replicas'
Traceback (most recent call last):
  File "/kube_downscaler/scaler.py", line 940, in autoscale_resource
    scale_down(
  File "/kube_downscaler/scaler.py", line 651, in scale_down
    if resource.annotations[ScaledObject.keda_pause_annotation] is not None:
KeyError: 'autoscaling.keda.sh/paused-replicas'
``
samuel-esp commented 1 month ago

Hi @cecchcc, the error you are facing now, was solved with #87, #91, #92. You should upgrade your installation at least to version v24.8.2

cecchcc commented 1 month ago

Hello @samuel-esp , We do not have the error anymore when we use the v24.8.2 and it downscales the deployment but the downscaler/downtime-replicas: '1' is not applied as it scales down the number of replicas to 0. Do you have any idea why?

samuel-esp commented 1 month ago

Hi @cecchcc, it seems that you encountered the behavior i was suspecting. Then you shoudl wait until we add the official compatibility for "downscaler/downtime-replicas" inside ScaledObject annotation. Just give me some time, because I want to test it again, when I first tried to replicate your situation that behavior you are describing didn't happen. I just want to double check again

samuel-esp commented 1 month ago

I managed to replicate agian your situation, and I can confirm the behavior you are describing it's happening. Wait for the next release and I'll include the support for downscaler/downtime-replicas annotation

cecchcc commented 3 weeks ago

Hello @samuel-esp ,

I saw that there was a new release today. I tried it but we are getting this error now

2024-10-17 12:43:19,217 ERROR: Failed to process ScaledObject orange/backend-api: HTTPSConnectionPool(host='10.100.0.1', port=443): Read timed out. (read timeout=10)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 507, in getresponse
    httplib_response = super().getresponse()
  File "/usr/local/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/local/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.10/socket.py", line 717, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.10/ssl.py", line 1307, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.10/ssl.py", line 1163, in read
    return self._sslobj.read(len, buffer)
TimeoutError: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 843, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 474, in increment
    raise reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 538, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 369, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='10.100.0.1', port=443): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/kube_downscaler/scaler.py", line 990, in autoscale_resource
    resource.update()
  File "/usr/local/lib/python3.10/site-packages/pykube/objects.py", line 165, in update
    self.patch(self.obj, subresource=subresource)
  File "/usr/local/lib/python3.10/site-packages/pykube/objects.py", line 150, in patch
    r = self.api.patch(
  File "/usr/local/lib/python3.10/site-packages/pykube/http.py", line 515, in patch
    return self.session.patch(*args, **self.get_kwargs(**kwargs))
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 661, in patch
    return self.request("PATCH", url, data=data, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pykube/http.py", line 181, in send
    response = self._do_send(request, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 713, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='10.100.0.1', port=443): Read timed out. (read timeout=10)

Kube-downscaler can't downscale and upscale because of a read timeout. We did not have this issue before. Did you encounter this issue already?

We added these 2 annotations on the ScaledObject and on the deployment

downscaler/downtime-replicas: '2'
downscaler/uptime: Mon-Fri 14:42-14:46 Europe/Paris
samuel-esp commented 3 weeks ago

Hi @cecchcc are you running a managed Kubernetes (EKS, AKS, GKE) or self hosted? Are you encountering this error only for Scaled Object or also for other workloads?

At first sight it seems the Kubernetes API Server is throttling

samuel-esp commented 3 weeks ago
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cron-scaling-deployment
  namespace: orange
  annotations:
    downscaler/downtime-replicas: "1"
  labels:
    app: cron-scaling-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cron-scaling-app
  template:
    metadata:
      labels:
        app: cron-scaling-app
    spec:
      containers:
      - name: cron-scaling-container
        image: nginx  
        ports:
        - containerPort: 80
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: cron-scaling-object
  namespace: orange
  annotations:
    downscaler/downtime-replicas: "1"
  labels:
    app: cron-scaling-app
spec:
  scaleTargetRef:
    name: cron-scaling-deployment  # The name of the deployment to scale
  minReplicaCount: 3  # Minimum number of replicas
  maxReplicaCount: 5  # Maximum number of replicas
  triggers:
  - type: cron
    metadata:
      timezone: Etc/UTC  
      start: "*/5 * * * *"  
      end: "1-59/5 * * * *" 
      desiredReplicas: "5"  

This is the KubeDownscaler config

apiVersion: v1
data:
  DEFAULT_DOWNTIME: Mon-Fri 10:00-19:00 CET
  EXCLUDE_NAMESPACES: py-kube-downscaler,kube-downscaler,kube-system
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: my-py-kube-downscaler
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2024-10-17T13:22:14Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: py-kube-downscaler
  namespace: default
  resourceVersion: "4295"
  uid: 52af7720-6c00-4a29-94f4-7851f6443cd7

This is my log

2024-10-17 13:35:46,967 INFO: Downscaler vdev started with admission_controller=, debug=False, default_downtime=Mon-Fri 10:00-19:00 CET, default_uptime=always, deployment_time_annotation=None, downscale_period=never, downtime_replicas=0, dry_run=False, enable_events=False, exclude_deployments=py-kube-downscaler,kube-downscaler,downscaler, exclude_namespaces=py-kube-downscaler,kube-downscaler,kube-system, grace_period=0, include_resources=deployments,statefulsets,scaledobjects, interval=60, matching_labels=, namespace=, once=False, upscale_period=never, upscale_target_only=False
2024-10-17 13:35:47,234 INFO: Scaling down Deployment local-path-storage/local-path-provisioner from 1 to 0 replicas (uptime: always, downtime: Mon-Fri 10:00-19:00 CET)
2024-10-17 13:35:47,271 INFO: Scaling down Deployment orange/cron-scaling-deployment from 5 to 1 replicas (uptime: always, downtime: Mon-Fri 10:00-19:00 CET)
2024-10-17 13:35:47,596 INFO: Pausing ScaledObject orange/cron-scaling-object (uptime: always, downtime: Mon-Fri 10:00-19:00 CET)
cecchcc commented 3 weeks ago

We are using EKS. I tried using the downscaler with deployments and it is working fine

2024-10-17 13:46:17,555 INFO: Scaling down Deployment orange/test-downscaler from 5 to 2 replicas (uptime: Mon-Fri 15:34-15:36 Europe/Paris, downtime: never)
2024-10-17 13:51:58,718 INFO: Scaling up Deployment orange/test-downscaler from 2 to 5 replicas (uptime: Mon-Fri 15:51-15:55 Europe/Paris, downtime: never)

We only encounter the error with the scaledobject

samuel-esp commented 3 weeks ago

Could you also share the configuration of KubeDownscaler deployment? How many workloads do you have inside your cluster that are in target of downscaling operation (more or less)? Are you using other workloads that needs to communicate a lot with the API Server?

Also it would be great if you could replicate the issue inside another Cluster. @Fovty @JTaeuber are you able guys to test it as well? Inside my test Cluster I wasn't able to reproduce this, I'll try soon with another one

cecchcc commented 3 weeks ago

We only have 1 workload, we are using mostly scaledObjects so we want to validate that it works fine before using it on other workloads.

Here is the Kubedownscaler configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-downscaler-py-kube-downscaler
  namespace: kube-downscaler
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: 24.10.1
    application: kube-downscaler-py-kube-downscaler
    argocd.argoproj.io/instance: kube-downscaler
    helm.sh/chart: py-kube-downscaler-0.2.10
spec:
  replicas: 1
  selector:
    matchLabels:
      application: kube-downscaler-py-kube-downscaler
  template:
    metadata:
      labels:
        application: kube-downscaler-py-kube-downscaler
    spec:
      containers:
        - name: py-kube-downscaler
          image: ghcr.io/caas-team/py-kube-downscaler:24.10.1
          args:
            - '--interval=60'
            - '--include-resources=deployments,statefulsets,scaledobjects'
samuel-esp commented 3 weeks ago

Sorry for asking many questions: could you provide also the version you are using for both Kubernetes and Keda?

samuel-esp commented 3 weeks ago

@JTaeuber @Fovty I saw PyKube was a little behind with dependencies so I opened caas-team/new-pykube#27 to bump dependencies there as well.

If I don't manage to replicate the issue tomorrow as well, I'll try to build a custom image for kube-downscaler with the new pykube version and I'll give it to @cecchcc to test it. It may be some weird network issue where pykube and kube-downscaler dependencies need to be perfectly aligned

samuel-esp commented 3 weeks ago

The new release seems to work fine with older kubernetes version releases as well (just tested on 1.27 self managed). I will try to understand if it is EKS related

@cecchcc another test you could do on your side is to try to replicate the behavior inside another test cluster if you have it at your disposal

cecchcc commented 3 weeks ago

We are using Keda 2.15.1 and Kubernetes 1.30

samuel-esp commented 3 weeks ago

@cecchcc can you join the slack in the docs? I'll give you some instruction to test a new image later this afternoon