apache / cloudstack-kubernetes-provider

Apache Cloudstack Kubernetes Provider
https://cloudstack.apache.org/
Apache License 2.0
36 stars 21 forks source link

Unable to auto-scale Kubernetes cluster #52

Open saffronjam opened 11 months ago

saffronjam commented 11 months ago

Hi!

I am unable to auto-scale Kubernetes clusters. As I understand, it create a "cluster-autoscaler" deployment that decides whether to scale or not. However, it does not seem to work, since it logs multiple errors and warnings in the pod, even though it is a completely clean cluster.

Normal scaling seems to work just fine.

Setup

A "default" CloudStack setup 4.18 running KVMs.

Settings (relevant)

The nodes uses the following service offering:

Replicate

  1. Create a new cluster using Kubernets 1.24 ISO found here: http://download.cloudstack.org/cks/

  2. Enable forced auto-scaling Since the cluster starts with only one worker node, auto-scaling with 3-5 nodes should trigger an upscale (I assume) Screenshot from 2023-08-07 16-55-00

  3. Check the logs for cluster-autoscaler in the Kubernetes cluster Some notable entries:

    
    E0807 14:41:30.317148       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: csidrivers.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "csidrivers" in API group "storage.k8s.io" at the cluster scope

E0807 14:41:32.388828 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch v1beta1.CSIStorageCapacity: failed to list v1beta1.CSIStorageCapacity: csistoragecapacities.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "csistoragecapacities" in API group "storage.k8s.io" at the cluster scope

Even though I have not edited anything myself (just a clean CKS cluster), I get these weird logs:

W0807 14:41:43.251280 1 clusterstate.go:590] Failed to get nodegroup for 6a4c91a3-9694-4596-9ddd-dc86e60136ff: Unable to find node 6a4c91a3-9694-4596-9ddd-dc86e60136ff in cluster

W0807 14:41:43.251361 1 clusterstate.go:590] Failed to get nodegroup for bd0b855f-6dc6-4678-9bea-b52329333024: Unable to find node bd0b855f-6dc6-4678-9bea-b52329333024 in cluster

I0807 14:57:06.667061 1 static_autoscaler.go:341] 2 unregistered nodes present


The IDs are correct in CloudStack

The entire log:
[logs-from-cluster-autoscaler-in-cluster-autoscaler-5bf887ddd8-hxg2g.log](https://github.com/apache/cloudstack-kubernetes-provider/files/12281530/logs-from-cluster-autoscaler-in-cluster-autoscaler-5bf887ddd8-hxg2g.log)

Please tell me if you need more logs to look at, or if I should try some other configuration.

Thanks!
rohityadavcloud commented 11 months ago

cc @Pearl1594 @weizhouapache @DaanHoogland pl help triage when you've time

@saffronjam this looks like an issue with k8s autoscaler (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/cloudstack/README.md) or with CKS (upstream https://github.com/apache/cloudstack)

kiranchavala commented 4 months ago

Hi @saffronjam

The autoscaling feature works fine on a k8s cluster deployed by CKS.

Please find the steps that i have followed

After you enable autoscaling on the cluster

Screenshot 2024-02-28 at 10 10 20 AM

Make sure the autoscaling pod is deployed in the cluster

kubectl get pods -A
NAMESPACE              NAME                                             READY   STATUS    RESTARTS      AGE
kube-system            cluster-autoscaler-8d8894d6c-q8r4h               1/1     Running   0             19m

Before scaling

➜  ~ k get nodes -A
NAME                     STATUS   ROLES           AGE   VERSION
gh-control-18debd77e18   Ready    control-plane   10h   v1.28.4
gh-node-18debd8440c      Ready    <none>          10h   v1.28.4

Deploy a application

kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.39 -- /agnhost netexec --http-port=80


➜  ~ k get pods -A
NAMESPACE              NAME                                             READY   STATUS    RESTARTS      AGE
default                hello-node-7c6c5fb9d8-bgd69                      1/1     Running   0             10h

Scale the application

kubectl scale --replicas=150 deployment/hello-node

logs from the autoscaler pod


I0228 04:51:46.798087       1 reflector.go:536] /home/djumani/lab/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356: Watch close - *v1.StatefulSet total 9 items received
I0228 04:51:51.244004       1 static_autoscaler.go:235] Starting main loop
I0228 04:51:51.244382       1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=listKubernetesClusters&id=14b42c5d-e7e6-4c41-b638-5facb98b0a93&response=json&signature=***
I0228 04:51:51.279721       1 client.go:175] NewAPIRequest response status code:200
I0228 04:51:51.280798       1 cloudstack_manager.go:88] Got cluster : &{14b42c5d-e7e6-4c41-b638-5facb98b0a93 gh 2 3 1 1 [0xc0013bfad0 0xc0013bfb00] map[gh-control-18debd77e18:0xc0013bfad0 gh-node-18debd8440c:0xc0013bfb00]}
W0228 04:51:51.292009       1 clusterstate.go:590] Failed to get nodegroup for dc95f481-15a3-4629-bb78-055fbe4a7139: Unable to find node dc95f481-15a3-4629-bb78-055fbe4a7139 in cluster
W0228 04:51:51.292052       1 clusterstate.go:590] Failed to get nodegroup for facdd040-53fe-4984-8654-c186a7cdde9b: Unable to find node facdd040-53fe-4984-8654-c186a7cdde9b in cluster
I0228 04:51:51.292095       1 static_autoscaler.go:341] 2 unregistered nodes present
I0228 04:51:51.292105       1 static_autoscaler.go:624] Removing unregistered node dc95f481-15a3-4629-bb78-055fbe4a7139
W0228 04:51:51.292126       1 static_autoscaler.go:627] Failed to get node group for dc95f481-15a3-4629-bb78-055fbe4a7139: Unable to find node dc95f481-15a3-4629-bb78-055fbe4a7139 in cluster
W0228 04:51:51.292137       1 static_autoscaler.go:346] Failed to remove unregistered nodes: Unable to find node dc95f481-15a3-4629-bb78-055fbe4a7139 in cluster
I0228 04:51:51.292569       1 filter_out_schedulable.go:65] Filtering out schedulables
I0228 04:51:51.292590       1 filter_out_schedulable.go:137] Filtered out 0 pods using hints
I0228 04:51:51.624523       1 filter_out_schedulable.go:175] 44 pods were kept as unschedulable based on caching
I0228 04:51:51.624568       1 filter_out_schedulable.go:176] 0 pods marked as unschedulable can be scheduled.
I0228 04:51:51.624667       1 filter_out_schedulable.go:87] No schedulable pods
I0228 04:51:51.870314       1 static_autoscaler.go:480] Calculating unneeded nodes
I0228 04:51:51.870353       1 pre_filtering_processor.go:66] Skipping gh-control-18debd77e18 - node group min size reached
I0228 04:51:51.870361       1 pre_filtering_processor.go:66] Skipping gh-node-18debd8440c - node group min size reached
I0228 04:51:51.870413       1 static_autoscaler.go:534] Scale down status: unneededOnly=false lastScaleUpTime=2024-02-27 17:39:33.032517071 +0000 UTC m=-3594.093061754 lastScaleDownDeleteTime=2024-02-27 17:39:33.032517071 +0000 UTC m=-3594.093061754 lastScaleDownFailTime=2024-02-27 17:39:33.032517071 +0000 UTC m=-3594.093061754 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false

I0228 04:52:22.258545       1 scale_up.go:468] Best option to resize: 14b42c5d-e7e6-4c41-b638-5facb98b0a93
I0228 04:52:22.258602       1 scale_up.go:472] Estimated 1 nodes needed in 14b42c5d-e7e6-4c41-b638-5facb98b0a93
I0228 04:52:22.266675       1 scale_up.go:595] Final scale-up plan: [{14b42c5d-e7e6-4c41-b638-5facb98b0a93 1->2 (max: 3)}]
I0228 04:52:22.266915       1 scale_up.go:691] Scale-up: setting group 14b42c5d-e7e6-4c41-b638-5facb98b0a93 size to 2
I0228 04:52:22.267040       1 cloudstack_node_group.go:57] Increase Cluster : 14b42c5d-e7e6-4c41-b638-5facb98b0a93 by 1
I0228 04:52:22.267238       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"3e317689-2939-4a66-b764-b2bb938c433c", APIVersion:"v1", ResourceVersion:"75712", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group 14b42c5d-e7e6-4c41-b638-5facb98b0a93 size to 2 instead of 1 (max: 3)
I0228 04:52:22.267350       1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=scaleKubernetesCluster&id=14b42c5d-e7e6-4c41-b638-5facb98b0a93&response=json&size=2&signature=***
I0228 04:52:22.297307       1 client.go:175] NewAPIRequest response status code:200
I0228 04:52:28.385682       1 reflector.go:536] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 10 items received
I0228 04:52:32.324971       1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=queryAsyncJobResult&jobid=4e62a5a3-825c-435e-a6df-c22e756ee5e4&response=json&signature=***
I0228 04:52:32.346120       1 client.go:175] NewAPIRequest response status code:200
I0228 04:52:32.360171       1 client.go:110] Still waiting for job 4e62a5a3-825c-435e-a6df-c22e756ee5e4 to complete
I0228 04:52:33.993372       1 reflector.go:536] /home/djumani/lab/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188: Watch close - *v1.Pod total 306 items received
I0228 04:52:42.328416       1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=queryAsyncJobResult&jobid=4e62a5a3-825c-435e-a6df-c22e756ee5e4&response=json&signature=***
I0228 04:52:42.357795       1 client.go:175] NewAPIRequest response status code:200
I0228 04:52:52.356394       1 client.go:110] Still waiting for job 4e62a5a3-825c-435e-a6df-c22e756ee5e4 to complete

➜  ~ k get nodes -A
NAME                     STATUS   ROLES           AGE   VERSION
gh-control-18debd77e18   Ready    control-plane   10h   v1.28.4
gh-node-18debd8440c      Ready    <none>          10h   v1.28.4
gh-node-18dee0e78ba      Ready    <none>          42s   v1.28.4
weizhouapache commented 1 month ago

I notice @saffronjam used a regular user to deploy the CKS cluster, would it be related to the issue ?