SovereignCloudStack / cluster-stacks

Definition of Cluster Stacks based on the ClusterAPI ClusterClass feature
Apache License 2.0
7 stars 6 forks source link

openstack-scs-1-29-v1 / openstack-scs-1-28-v2 not deployable (cilium issues) #143

Closed Nils98Ar closed 5 days ago

Nils98Ar commented 1 month ago

/kind bug

What steps did you take and what happened:

Create an openstack-scs-1-29-v1 or openstack-scs-1-28-v2 cluster.

The cluster deployment stucks at 3/3 worker nodes and 1/3 control plane node. All nodes stuck in the status NotReady. The nodes do not get an internal IP:

NAME                                   STATUS     ROLES           VERSION   INTERNAL-IP
cluster-scs-n64mk-f4xgt                NotReady   control-plane   v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-n596p   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-npxzf   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-vrtdl   NotReady   <none>          v1.29.6   <none>

Different pods have the following line in their logs:

Error from server: no preferred addresses found; known addresses: []

One of the first errors in the nodes /var/log/syslog is:

cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config

The directory /etc/cni/net.d is empty on the nodes.

What did you expect to happen:

The cluster is created successfully and usable.

Nils98Ar commented 1 month ago

This could be the reason (cso-controller-manager logs):

  "level": "ERROR",
  "time": "2024-07-18T15:36:24.881Z",
  "file": "kube/kube.go:206",
  "message": "failed to apply object",
  "controller": "clusteraddon",
  "controllerGroup": "",
  "controllerKind": "ClusterAddon",
  "ClusterAddon": {
    "name": "cluster-addon-cluster-scs",
    "namespace": "project-test"
  "namespace": "kube-system",
  "name": "cilium",
  "reconcileID": "ca7c0a4b-19a8-47f6-a99a-04c254712b1d",
  "obj": "apps/v1, Kind=DaemonSet",
  "error": "failed to apply object: failed to create typed patch object (kube-system/cilium; apps/v1, Kind=DaemonSet): .spec.template.spec.securityContext.appArmorProfile: field not declared in schema",
  "stacktrace": "*kube).Apply\n\t/src/cluster-stack-operator/pkg/kube/kube.go:206\*ClusterAddonReconciler).templateAndApplyClusterAddonHelmChart\n\t/src/cluster-stack-operator/internal/controller/clusteraddon_controller.go:737\*ClusterAddonReconciler).Reconcile\n\t/src/cluster-stack-operator/internal/controller/clusteraddon_controller.go:276\*Controller).Reconcile\n\t/src/cluster-stack-operator/vendor/\*Controller).reconcileHandler\n\t/src/cluster-stack-operator/vendor/\*Controller).processNextWorkItem\n\t/src/cluster-stack-operator/vendor/\*Controller).Start.func2.2\n\t/src/cluster-stack-operator/vendor/"
Nils98Ar commented 1 month ago

Seems that .spec.template.spec.securityContext.appArmorProfile was introduced in Kubernetes 1.30 and in cilium helm chart version 1.15.5 (the mentioned ClusterStack Releases use version 1.15.6).

The helm chart should normally check the Kubernetes version using .Capabilities.KubeVersion.Version during helm install and skip the appArmorProfile for Kubernetes versions < 1.30 . Maybe this does not work in the ClusterStacks scenario? I am not sure in which context the templating is done. E.g.

Nils98Ar commented 1 month ago

These should be all relevant parts of the helm chart with checks for Kubernetes < 1.30:

chess-knight commented 1 month ago

CSO does helm template | kubectl apply -f - and that's why Cilium's helm chart semverCompare logic doesn't work here. It should work for 1.30 as you wrote. But for <1.30.0 it is a bug.

Nils98Ar commented 1 month ago

Yes it does work for 1.30.

Nils98Ar commented 1 month ago


Nils98Ar commented 1 month ago

By the way: It seems that older Kubernetes 1.28/1.29 openstack-scs releases do not work as well because of a missing security group „0“ according to cspo. But I guess as soon as the new versions work the old ones are obsolete anyway.

chess-knight commented 1 month ago

By the way: It seems that older Kubernetes 1.28/1.29 openstack-scs releases do not work as well because of a missing security group „0“ according to cspo. But I guess as soon as the new versions work the old ones are obsolete anyway.

AFAIK CSPO only cares about node images. What do you mean by security group „0“?

michal-gubricky commented 1 month ago

/kind bug

What steps did you take and what happened:

Create an openstack-scs-1-29-v1 or openstack-scs-1-28-v2 cluster.

The cluster deployment stucks at 3/3 worker nodes and 1/3 control plane node. All nodes stuck in the status NotReady. The nodes do not get an internal IP:

NAME                                   STATUS     ROLES           VERSION   INTERNAL-IP
cluster-scs-n64mk-f4xgt                NotReady   control-plane   v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-n596p   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-npxzf   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-vrtdl   NotReady   <none>          v1.29.6   <none>

Different pods have the following line in their logs:

Error from server: no preferred addresses found; known addresses: []

One of the first errors in the nodes /var/log/syslog is:

cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config

The directory /etc/cni/net.d is empty on the nodes.

What did you expect to happen:

The cluster is created successfully and usable.

Hi @Nils98Ar, I just tested the creation of the cluster using the main branch of the cluster-stacks repo, built it via csctl, and did not encounter your error. The Kubernetes version is 1.28.11.

NAME                                            STATUS   ROLES           AGE     VERSION
test-cluster-5cgr8-4pj6m                        Ready    control-plane   3m38s   v1.28.11
test-cluster-5cgr8-tst5n                        Ready    control-plane   31m     v1.28.11
test-cluster-5cgr8-xdwvh                        Ready    control-plane   24m     v1.28.11
test-cluster-default-worker-b6fx8-8zrmf-2v865   Ready    <none>          28m     v1.28.11
test-cluster-default-worker-b6fx8-8zrmf-jdldc   Ready    <none>          24m     v1.28.11
test-cluster-default-worker-b6fx8-8zrmf-p5wh7   Ready    <none>          24m     v1.28.11
chess-knight commented 1 month ago

@michal-gubricky, what is the state of the ClusterAddon object?

michal-gubricky commented 1 month ago

@michal-gubricky, what is the state of the ClusterAddon object?

Here are all pods in kube-system namespace and also state of the cluster-addon resource:

ubuntu@mg-cluster-stack-vm:~$ k get cluster-addon-test-cluster 
NAME                         CLUSTER        HOOK   READY   AGE   REASON   MESSAGE
cluster-addon-test-cluster   test-cluster          true    79m 
ubuntu@mg-cluster-stack-vm:~$ k get po -n kube-system --kubeconfig test-cluster.kubeconfig 
NAME                                                     READY   STATUS    RESTARTS         AGE
cilium-fk2b9                                             1/1     Running   1                66m
cilium-gmh4x                                             1/1     Running   0                39m
cilium-l9jgw                                             1/1     Running   0                63m
cilium-lgmsv                                             1/1     Running   0                60m
cilium-mj7qz                                             1/1     Running   1 (49m ago)      60m
cilium-ncxr4                                             1/1     Running   0                52m
cilium-operator-8645b8bb4f-ppd9l                         1/1     Running   9 (3m28s ago)    66m
cilium-operator-8645b8bb4f-v9vl7                         1/1     Running   9 (5m46s ago)    66m
coredns-5dd5756b68-fhdn2                                 1/1     Running   0                66m
coredns-5dd5756b68-r7mwx                                 1/1     Running   0                66m
etcd-test-cluster-5cgr8-4pj6m                            1/1     Running   1 (19m ago)      39m
etcd-test-cluster-5cgr8-tst5n                            1/1     Running   1 (19m ago)      66m
etcd-test-cluster-5cgr8-xdwvh                            1/1     Running   0                60m
kube-apiserver-test-cluster-5cgr8-4pj6m                  1/1     Running   2 (21m ago)      39m
kube-apiserver-test-cluster-5cgr8-tst5n                  1/1     Running   5 (23m ago)      67m
kube-apiserver-test-cluster-5cgr8-xdwvh                  1/1     Running   4 (23m ago)      60m
kube-controller-manager-test-cluster-5cgr8-4pj6m         1/1     Running   1 (27m ago)      39m
kube-controller-manager-test-cluster-5cgr8-tst5n         1/1     Running   9 (3m33s ago)    66m
kube-controller-manager-test-cluster-5cgr8-xdwvh         1/1     Running   3 (5m48s ago)    60m
kube-proxy-5dhjg                                         1/1     Running   0                39m
kube-proxy-649sl                                         1/1     Running   0                60m
kube-proxy-7gs4w                                         1/1     Running   0                63m
kube-proxy-7hpxb                                         1/1     Running   0                66m
kube-proxy-c62mx                                         1/1     Running   0                52m
kube-proxy-ch5fd                                         1/1     Running   0                52m
kube-scheduler-test-cluster-5cgr8-4pj6m                  1/1     Running   2 (3m28s ago)    39m
kube-scheduler-test-cluster-5cgr8-tst5n                  1/1     Running   8 (24m ago)      67m
kube-scheduler-test-cluster-5cgr8-xdwvh                  1/1     Running   3 (5m46s ago)    60m
metrics-server-666c6745d5-d6nvf                          1/1     Running   0                66m
openstack-cinder-csi-controllerplugin-78c4557887-qhvjr   6/6     Running   17 (3m30s ago)   66m
openstack-cinder-csi-nodeplugin-qxmdq                    3/3     Running   0                60m
openstack-cinder-csi-nodeplugin-rvppn                    3/3     Running   0                63m
openstack-cinder-csi-nodeplugin-ssll9                    3/3     Running   0                39m
openstack-cinder-csi-nodeplugin-t6cql                    3/3     Running   0                66m
openstack-cinder-csi-nodeplugin-vt9z6                    3/3     Running   0                60m
openstack-cinder-csi-nodeplugin-w8lcr                    3/3     Running   0                60m
openstack-cloud-controller-manager-4vdcl                 1/1     Running   2 (2m47s ago)    52m
openstack-cloud-controller-manager-6dkfw                 1/1     Running   2 (5m48s ago)    35m
openstack-cloud-controller-manager-f84zp                 1/1     Running   4 (19m ago)      46m
chess-knight commented 1 month ago

AS @Nils98Ar wrote, the breaking change was introduced in the cilium chart version 1.15.5. The main branch installs version 1.15.2, that's why it works for you @michal-gubricky. I checked cluster-addon/Chart.lock vs cluster-addon/Chart.yaml, which differs. We are missing the helm dependency update command there. Please also check the release- branches, where it is correct.

michal-gubricky commented 1 month ago

AS @Nils98Ar wrote, the breaking change was introduced in the cilium chart version 1.15.5. The main branch installs version 1.15.2, that's why it works for you @michal-gubricky. I checked cluster-addon/Chart.lock vs cluster-addon/Chart.yaml, which differs. We are missing the helm dependency update command there. Please also check the release- branches, where it is correct.

Yeah, I was just looking at the version in Chart.yaml and there is 1.15.6.

chess-knight commented 3 weeks ago

Hi @janiskemper, can you please take a look? IMO we have three options here:

  1. Use a workaround from #150
  2. Use helm template --kube-version ... in the CSO cluster-addon logic if it is possible for this controller to know the workload k8s version. This of course needs to be tested if it is enough first. I think that also not only for the cilium helm chart it is a good idea to template charts with known k8s version, because probably multiple helm charts use Capabilities.KubeVersion.
  3. Downgrade cilium chart for k8s < 1.30