cnti-testcatalog / testsuite

πŸ“žπŸ“±β˜ŽοΈπŸ“‘πŸŒ Cloud Native Telecom Initiative (CNTI) Test Catalog is a tool to check for and provide feedback on the use of K8s + cloud native best practices in networking applications and platforms
https://wiki.lfnetworking.org/display/LN/Test+Catalog
Apache License 2.0
169 stars 70 forks source link

[Feature] cluster_api_setup enhancement #1981

Open svteb opened 2 months ago

svteb commented 2 months ago

Is your feature request related to a problem? Please describe. The cluster_api_setup function does not seem to create a fully functioning cluster:

./cnf-testsuite cluster_api_setup
...
I, [2024-04-17 07:44:11 +00:00 #62977]  INFO -- cnf-testsuite: wait_for_install_by_apply
I, [2024-04-17 07:44:11 +00:00 #62977]  INFO -- cnf-testsuite: KubectlClient::Apply.file command: kubectl apply -f /home/ubuntu/testsuite/capi.yaml
I, [2024-04-17 07:44:13 +00:00 #62977]  INFO -- cnf-testsuite: second_count = 0
I, [2024-04-17 07:44:14 +00:00 #62977]  INFO -- cnf-testsuite: KubectlClient::Apply.file command: kubectl apply -f /home/ubuntu/testsuite/capi.yaml
I, [2024-04-17 07:44:15 +00:00 #62977]  INFO -- cnf-testsuite: second_count = 1
I, [2024-04-17 07:44:16 +00:00 #62977]  INFO -- cnf-testsuite: KubectlClient::Apply.file command: kubectl apply -f /home/ubuntu/testsuite/capi.yaml
I, [2024-04-17 07:44:16 +00:00 #62977]  INFO -- cnf-testsuite: second_count = 2
I, [2024-04-17 07:44:17 +00:00 #62977]  INFO -- cnf-testsuite: KubectlClient::Apply.file command: kubectl apply -f /home/ubuntu/testsuite/capi.yaml
...
Goes on until second_count = 180

The same seems to occur even in github actions. Furthermore:

kubectl get cluster && echo "-------" && kubectl get clusterclass && echo "-------" && kubectl get kubeadmcontrolplane && echo "-------" && kubectl get machinedeployments && echo "-------" && kubectl get machinepools && echo "-------" && kubectl get dockerclustertemplate && echo "-------" && kubectl get dockermachinetemplate && echo "-------" && kubectl get dockermachinepooltemplate && echo "-------" && kubectl get kubeadmconfigtemplate
NAME              CLUSTERCLASS   PHASE          AGE   VERSION
capi-quickstart   quick-start    Provisioning   12m   v1.23.0
-------
NAME          AGE
quick-start   12m
-------
NAME                    CLUSTER           INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
capi-quickstart-x84tj   capi-quickstart                                                                                   12m   v1.23.0
-------
NAME                         CLUSTER           REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE       AGE   VERSION
capi-quickstart-md-0-mtzjv   capi-quickstart   3                  3         3             ScalingUp   12m   v1.23.0
-------
NAME                         CLUSTER           REPLICAS   PHASE     AGE   VERSION
capi-quickstart-mp-0-npvf4   capi-quickstart              Pending   12m   v1.23.0
-------
NAME                  AGE
quick-start-cluster   12m
-------
NAME                                         AGE
capi-quickstart-md-0-dvkh4                   12m
capi-quickstart-whb7x                        12m
quick-start-control-plane                    12m
quick-start-default-worker-machinetemplate   12m
-------
NAME                                             AGE
quick-start-default-worker-machinepooltemplate   12m
-------
NAME                                           AGE
capi-quickstart-md-0-x2khp                     12m
quick-start-default-worker-bootstraptemplate   12m

It is apparent that the cluster does not get fully provisioned, on my side there seem to be various issues with the control-plane.

This does not impede the clusterapi_enabled test as it does not verify in-depth functionality, but I think it would be beneficial to setup a full-fledged cluster that allows for more refined tests.

Describe the solution you'd like Since KinD is the main cluster creation tool for the testsuite development, the clusterapi deployment should be tailored in such way that it will work more properly. Ironing out all the issues will probably take considerable effort (I've spent some solid 3 days trying to make sense of clusterapi but did not get too far).

Nevertheless I've come across some hacks that resolved a few errors:

  1. Mounting the docker socket The capd-controller logs (which I currently do not have) mentioned a missing docker socket mount, it should be mentioned somewhere that this is necessary. This is resolved by adding it during cluster creation:

KinD (there is probably a better way but this worked)

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  ipFamily: dual
nodes:
- role: control-plane
  extraMounts:
    - hostPath: /var/run/docker.sock
      containerPath: /var/run/docker.sock
- role: worker
  extraMounts:
    - hostPath: /var/run/docker.sock
      containerPath: /var/run/docker.sock
- role: worker
  extraMounts:
    - hostPath: /var/run/docker.sock
      containerPath: /var/run/docker.sock

minikube

minikube start --nodes=3 --cpus="no-limit" --memory=6g --kubernetes-version="v1.23.13" --container-runtime=containerd --mount --mount-string="/var/run/docker.sock:/var/run/docker.sock"
  1. Exporting the EXP_MACHINE_POOL Once again I do not know to what extent this is necessary, but it allows the machinepools resource to move from Pending to Running (which resolved some of the control-plane) errors.
    export EXP_MACHINE_POOL=true

Once this issue is addressed how will the fix be verified? There should be some health check of the workload cluster that is created after applying capi.yaml file.