saffronjam commented 11 months ago

Hi!

I am unable to auto-scale Kubernetes clusters. As I understand, it create a "cluster-autoscaler" deployment that decides whether to scale or not. However, it does not seem to work, since it logs multiple errors and warnings in the pod, even though it is a completely clean cluster.

Normal scaling seems to work just fine.

Setup

A "default" CloudStack setup 4.18 running KVMs.

Settings (relevant)

Cloud kubernetes service enabled true
Cloud kubernetes cluster experimental features enabled true
Cloud kubernetes cluster max size 50

The nodes uses the following service offering:

2 CPU x 2.05 Ghz
2048 MB memory
8 GB root disk

Replicate

Create a new cluster using Kubernets 1.24 ISO found here: http://download.cloudstack.org/cks/
Enable forced auto-scaling Since the cluster starts with only one worker node, auto-scaling with 3-5 nodes should trigger an upscale (I assume)

Check the logs for cluster-autoscaler in the Kubernetes cluster Some notable entries:


E0807 14:41:30.317148       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: csidrivers.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "csidrivers" in API group "storage.k8s.io" at the cluster scope

E0807 14:41:32.388828 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch v1beta1.CSIStorageCapacity: failed to list v1beta1.CSIStorageCapacity: csistoragecapacities.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "csistoragecapacities" in API group "storage.k8s.io" at the cluster scope

Even though I have not edited anything myself (just a clean CKS cluster), I get these weird logs:

W0807 14:41:43.251280 1 clusterstate.go:590] Failed to get nodegroup for 6a4c91a3-9694-4596-9ddd-dc86e60136ff: Unable to find node 6a4c91a3-9694-4596-9ddd-dc86e60136ff in cluster

W0807 14:41:43.251361 1 clusterstate.go:590] Failed to get nodegroup for bd0b855f-6dc6-4678-9bea-b52329333024: Unable to find node bd0b855f-6dc6-4678-9bea-b52329333024 in cluster

I0807 14:57:06.667061 1 static_autoscaler.go:341] 2 unregistered nodes present


The IDs are correct in CloudStack

The entire log:
[logs-from-cluster-autoscaler-in-cluster-autoscaler-5bf887ddd8-hxg2g.log](https://github.com/apache/cloudstack-kubernetes-provider/files/12281530/logs-from-cluster-autoscaler-in-cluster-autoscaler-5bf887ddd8-hxg2g.log)

Please tell me if you need more logs to look at, or if I should try some other configuration.

Thanks!

rohityadavcloud commented 11 months ago

cc @Pearl1594 @weizhouapache @DaanHoogland pl help triage when you've time

@saffronjam this looks like an issue with k8s autoscaler (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/cloudstack/README.md) or with CKS (upstream https://github.com/apache/cloudstack)

kiranchavala commented 4 months ago

Hi @saffronjam

The autoscaling feature works fine on a k8s cluster deployed by CKS.

Please find the steps that i have followed

After you enable autoscaling on the cluster

Screenshot 2024-02-28 at 10 10 20 AM

Make sure the autoscaling pod is deployed in the cluster

kubectl get pods -A
NAMESPACE              NAME                                             READY   STATUS    RESTARTS      AGE
kube-system            cluster-autoscaler-8d8894d6c-q8r4h               1/1     Running   0             19m

Before scaling

➜  ~ k get nodes -A
NAME                     STATUS   ROLES           AGE   VERSION
gh-control-18debd77e18   Ready    control-plane   10h   v1.28.4
gh-node-18debd8440c      Ready    <none>          10h   v1.28.4

Deploy a application

kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.39 -- /agnhost netexec --http-port=80


➜  ~ k get pods -A
NAMESPACE              NAME                                             READY   STATUS    RESTARTS      AGE
default                hello-node-7c6c5fb9d8-bgd69                      1/1     Running   0             10h

Scale the application

kubectl scale --replicas=150 deployment/hello-node

logs from the autoscaler pod


I0228 04:51:46.798087       1 reflector.go:536] /home/djumani/lab/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356: Watch close - *v1.StatefulSet total 9 items received
I0228 04:51:51.244004       1 static_autoscaler.go:235] Starting main loop
I0228 04:51:51.244382       1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=listKubernetesClusters&id=14b42c5d-e7e6-4c41-b638-5facb98b0a93&response=json&signature=***
I0228 04:51:51.279721       1 client.go:175] NewAPIRequest response status code:200
I0228 04:51:51.280798       1 cloudstack_manager.go:88] Got cluster : &{14b42c5d-e7e6-4c41-b638-5facb98b0a93 gh 2 3 1 1 [0xc0013bfad0 0xc0013bfb00] map[gh-control-18debd77e18:0xc0013bfad0 gh-node-18debd8440c:0xc0013bfb00]}
W0228 04:51:51.292009       1 clusterstate.go:590] Failed to get nodegroup for dc95f481-15a3-4629-bb78-055fbe4a7139: Unable to find node dc95f481-15a3-4629-bb78-055fbe4a7139 in cluster
W0228 04:51:51.292052       1 clusterstate.go:590] Failed to get nodegroup for facdd040-53fe-4984-8654-c186a7cdde9b: Unable to find node facdd040-53fe-4984-8654-c186a7cdde9b in cluster
I0228 04:51:51.292095       1 static_autoscaler.go:341] 2 unregistered nodes present
I0228 04:51:51.292105       1 static_autoscaler.go:624] Removing unregistered node dc95f481-15a3-4629-bb78-055fbe4a7139
W0228 04:51:51.292126       1 static_autoscaler.go:627] Failed to get node group for dc95f481-15a3-4629-bb78-055fbe4a7139: Unable to find node dc95f481-15a3-4629-bb78-055fbe4a7139 in cluster
W0228 04:51:51.292137       1 static_autoscaler.go:346] Failed to remove unregistered nodes: Unable to find node dc95f481-15a3-4629-bb78-055fbe4a7139 in cluster
I0228 04:51:51.292569       1 filter_out_schedulable.go:65] Filtering out schedulables
I0228 04:51:51.292590       1 filter_out_schedulable.go:137] Filtered out 0 pods using hints
I0228 04:51:51.624523       1 filter_out_schedulable.go:175] 44 pods were kept as unschedulable based on caching
I0228 04:51:51.624568       1 filter_out_schedulable.go:176] 0 pods marked as unschedulable can be scheduled.
I0228 04:51:51.624667       1 filter_out_schedulable.go:87] No schedulable pods
I0228 04:51:51.870314       1 static_autoscaler.go:480] Calculating unneeded nodes
I0228 04:51:51.870353       1 pre_filtering_processor.go:66] Skipping gh-control-18debd77e18 - node group min size reached
I0228 04:51:51.870361       1 pre_filtering_processor.go:66] Skipping gh-node-18debd8440c - node group min size reached
I0228 04:51:51.870413       1 static_autoscaler.go:534] Scale down status: unneededOnly=false lastScaleUpTime=2024-02-27 17:39:33.032517071 +0000 UTC m=-3594.093061754 lastScaleDownDeleteTime=2024-02-27 17:39:33.032517071 +0000 UTC m=-3594.093061754 lastScaleDownFailTime=2024-02-27 17:39:33.032517071 +0000 UTC m=-3594.093061754 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false

I0228 04:52:22.258545       1 scale_up.go:468] Best option to resize: 14b42c5d-e7e6-4c41-b638-5facb98b0a93
I0228 04:52:22.258602       1 scale_up.go:472] Estimated 1 nodes needed in 14b42c5d-e7e6-4c41-b638-5facb98b0a93
I0228 04:52:22.266675       1 scale_up.go:595] Final scale-up plan: [{14b42c5d-e7e6-4c41-b638-5facb98b0a93 1->2 (max: 3)}]
I0228 04:52:22.266915       1 scale_up.go:691] Scale-up: setting group 14b42c5d-e7e6-4c41-b638-5facb98b0a93 size to 2
I0228 04:52:22.267040       1 cloudstack_node_group.go:57] Increase Cluster : 14b42c5d-e7e6-4c41-b638-5facb98b0a93 by 1
I0228 04:52:22.267238       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"3e317689-2939-4a66-b764-b2bb938c433c", APIVersion:"v1", ResourceVersion:"75712", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group 14b42c5d-e7e6-4c41-b638-5facb98b0a93 size to 2 instead of 1 (max: 3)
I0228 04:52:22.267350       1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=scaleKubernetesCluster&id=14b42c5d-e7e6-4c41-b638-5facb98b0a93&response=json&size=2&signature=***
I0228 04:52:22.297307       1 client.go:175] NewAPIRequest response status code:200
I0228 04:52:28.385682       1 reflector.go:536] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 10 items received
I0228 04:52:32.324971       1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=queryAsyncJobResult&jobid=4e62a5a3-825c-435e-a6df-c22e756ee5e4&response=json&signature=***
I0228 04:52:32.346120       1 client.go:175] NewAPIRequest response status code:200
I0228 04:52:32.360171       1 client.go:110] Still waiting for job 4e62a5a3-825c-435e-a6df-c22e756ee5e4 to complete
I0228 04:52:33.993372       1 reflector.go:536] /home/djumani/lab/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188: Watch close - *v1.Pod total 306 items received
I0228 04:52:42.328416       1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=queryAsyncJobResult&jobid=4e62a5a3-825c-435e-a6df-c22e756ee5e4&response=json&signature=***
I0228 04:52:42.357795       1 client.go:175] NewAPIRequest response status code:200
I0228 04:52:52.356394       1 client.go:110] Still waiting for job 4e62a5a3-825c-435e-a6df-c22e756ee5e4 to complete


➜  ~ k get nodes -A
NAME                     STATUS   ROLES           AGE   VERSION
gh-control-18debd77e18   Ready    control-plane   10h   v1.28.4
gh-node-18debd8440c      Ready    <none>          10h   v1.28.4
gh-node-18dee0e78ba      Ready    <none>          42s   v1.28.4

weizhouapache commented 1 month ago

I notice @saffronjam used a regular user to deploy the CKS cluster, would it be related to the issue ?

apache / cloudstack-kubernetes-provider

Unable to auto-scale Kubernetes cluster #52

Setup

Settings (relevant)

Replicate