CKS: cluster does not fully deploy due to sshd not started on control node

DaanHoogland commented 3 months ago

ISSUE TYPE

Bug Report

COMPONENT NAME

CKS

CLOUDSTACK VERSION

4.19.0

CONFIGURATION

simple installation with CKS enabled

OS / ENVIRONMENT

4.19 with any hypervisor/network model

SUMMARY

when starting a CKS cluster the control node does not enable ssh and thus the cluster never comes up.

STEPS TO REPRODUCE

enable cks
install a cks image
deploy a cluster

EXPECTED RESULTS

cluster comes up (control node accessable with ssh)

ACTUAL RESULTS

no ssh deamon active on the control node and hence deploy times out.

weizhouapache commented 3 months ago

@DaanHoogland can you share the hypervisor type and version ? cks iso link ?

DaanHoogland commented 3 months ago

cks tried by me is 1.27.8, but user reported trying several versions. Host os is vmware, and I will verify others and update the description. I am first checking 4.18.1 (and possibly before) to see when it was introduced.

weizhouapache commented 3 months ago

and network type, etc

the public ip of the CKS cluster should be accessible from cloudstack mgmt server in some setup, if the mgmt server (in private network) cannot access the cks nodes (via public IP) and get the status of cks cluster, the cluster might end in Error state

luganofer commented 3 months ago

@weizhouapache In the tests I did recreating the problem, the connectivity between the public ip of CKS cluster and the managers is enabled and the problem occurs anyway. Also, if one of the nodes is accessed via console and the ssh service is started manually, from the managers I can establish the ssh connection.

weizhouapache commented 3 months ago

the problem occurs anyway. Also, if one of the nodes is accessed via console and the ssh service is started manually, from the managers I can establish the ssh connection.

ok @luganofer did it happen every time ? or just once ?

luganofer commented 3 months ago

@weizhouapache At least in my lab environment it happens in every deployment I try and with several different k8s versions (1.28.4, 1.27.8, 1.27.3, 1.26.6)

weizhouapache commented 3 months ago

@weizhouapache At least in my lab environment it happens in every deployment I try and with several different k8s versions (1.28.4, 1.27.8, 1.27.3, 1.26.6)

@luganofer can you also the hypervisor type and version, the link of cks iso ?

luganofer commented 3 months ago

@weizhouapache I am using VMware vSphere 8.0c and all the ISOs were downloaded from the following link: https://download.cloudstack.org/cks/

Pearl1594 commented 3 months ago

@luganofer As the nodes / VMs come up, do you see any error logs in the VM console?

luganofer commented 3 months ago

Hi @Pearl1594, no error logs en console VM.
Only the following error is observed in managers logs:

ERROR [c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-84:ctx-216a9879 job-150679 ctx-7f68a95d) (logid:e493dc8f) Failed to setup Kubernetes cluster : maradona in usable state as unable to access control node VMs of the cluster

From my perspective, the problem is related to nodes that do not initialise correctly (cloud-init ?). They receive ip by dhcp from the VR, but do not change the hostname and fundamentally do not start the ssh service so the deployed nodes cannot be reached by the acs managers (via ssh) and the correct deployment of the k8s cluster is not completed.

weizhouapache commented 3 months ago

Hi @Pearl1594, no error logs en console VM.
Only the following error is observed in managers logs:

ERROR [c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-84:ctx-216a9879 job-150679 ctx-7f68a95d) (logid:e493dc8f) Failed to setup Kubernetes cluster : maradona in usable state as unable to access control node VMs of the cluster

From my perspective, the problem is related to nodes that do not initialise correctly (cloud-init ?). They receive ip by dhcp from the VR, but do not change the hostname and fundamentally do not start the ssh service so the deployed nodes cannot be reached by the acs managers (via ssh) and the correct deployment of the k8s cluster is not completed.

@luganofer If you are able to log into the vm (is the password still "password"?) and restart ssh, can you check the cloud-init logs? /var/log/cloud-init-*

Can you also double check the vmware version? 8.0c, 8.0 update 1c or 8.0 update 2c?

DaanHoogland commented 3 months ago

cks tried by me is 1.27.8, but user reported trying several versions. Host os is vmware, and I will verify others and update the description. I am first checking 4.18.1 (and possibly before) to see when it was introduced.

Sorry, I forgot to feedback; xcpng and kvm seem to work, just vmware is broken.

weizhouapache commented 3 months ago

cks tried by me is 1.27.8, but user reported trying several versions. Host os is vmware, and I will verify others and update the description. I am first checking 4.18.1 (and possibly before) to see when it was introduced.

Sorry, I forgot to feedback; xcpng and kvm seem to work, just vmware is broken.

which vmware version did you test? @DaanHoogland It seems to be working in Trillian tests

DaanHoogland commented 3 months ago

What test is verifying this @weizhouapache ? (as I recall it was 70u3, but I'll check)

DaanHoogland commented 3 months ago

it was 80u1 , @weizhouapache

weizhouapache commented 3 months ago

it was 80u1 , @weizhouapache

80u1 (8.0.1.0) is not working. See #7572. Do not run 4.18/4.19 test with it. However, 4.20 seems to be working with 80u1.
we use 8.0b (8.0.0.2) in Trillian tests with vmware-80. It has been run many times. The test results look good.
the reporter uses 8.0c (8.0.0.3, if the version is correct). Maybe we can upgrade trillian vm template from 8.0b to 8.0c and run some tests @DaanHoogland

weizhouapache commented 3 weeks ago

@DaanHoogland there is a known issue that systemvm/cks node is stuck at Starting on vmware 80u1 https://github.com/apache/cloudstack/issues/7572 @DaanHoogland will you move this to 4.20.0.0 milestone and test it later ?

@sureshanaparti is working on vmware 80u1/u2/u3 support in 4.20.0.0

weizhouapache commented 1 week ago

if this issue happens with vmware 8.0u1/u2/u3, it should have been addressed by #9625

cc @DaanHoogland @rohityadavcloud @sureshanaparti @JoaoJandre

JoaoJandre commented 2 days ago

if this issue happens with vmware 8.0u1/u2/u3, it should have been addressed by #9625

cc @DaanHoogland @rohityadavcloud @sureshanaparti @JoaoJandre

@weizhouapache I do not have a VMware 8 env to test this.

Could someone validate if the issue persists after #9625? cc @DaanHoogland @rohityadavcloud @sureshanaparti

DaanHoogland commented 1 day ago

Tested on both 8.0u2 and 8.0u3 both clusters are marked as running, so I think it safe to assume this is solved.

apache / cloudstack