SovereignCloudStack / cluster-stacks

Definition of Cluster Stacks based on the ClusterAPI ClusterClass feature
https://scs.community/
Apache License 2.0
7 stars 6 forks source link

openstack-scs-1-29-v1 / openstack-scs-1-28-v2 not deployable (cilium issues) #143

Closed Nils98Ar closed 5 days ago

Nils98Ar commented 1 month ago

/kind bug

What steps did you take and what happened:

Create an openstack-scs-1-29-v1 or openstack-scs-1-28-v2 cluster.

The cluster deployment stucks at 3/3 worker nodes and 1/3 control plane node. All nodes stuck in the status NotReady. The nodes do not get an internal IP:

NAME                                   STATUS     ROLES           VERSION   INTERNAL-IP
cluster-scs-n64mk-f4xgt                NotReady   control-plane   v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-n596p   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-npxzf   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-vrtdl   NotReady   <none>          v1.29.6   <none>

Different pods have the following line in their logs:

Error from server: no preferred addresses found; known addresses: []

One of the first errors in the nodes /var/log/syslog is:

cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config

The directory /etc/cni/net.d is empty on the nodes.

What did you expect to happen:

The cluster is created successfully and usable.

Nils98Ar commented 1 month ago

This could be the reason (cso-controller-manager logs):

{
  "level": "ERROR",
  "time": "2024-07-18T15:36:24.881Z",
  "file": "kube/kube.go:206",
  "message": "failed to apply object",
  "controller": "clusteraddon",
  "controllerGroup": "clusterstack.x-k8s.io",
  "controllerKind": "ClusterAddon",
  "ClusterAddon": {
    "name": "cluster-addon-cluster-scs",
    "namespace": "project-test"
  },
  "namespace": "kube-system",
  "name": "cilium",
  "reconcileID": "ca7c0a4b-19a8-47f6-a99a-04c254712b1d",
  "obj": "apps/v1, Kind=DaemonSet",
  "error": "failed to apply object: failed to create typed patch object (kube-system/cilium; apps/v1, Kind=DaemonSet): .spec.template.spec.securityContext.appArmorProfile: field not declared in schema",
  "stacktrace": "github.com/SovereignCloudStack/cluster-stack-operator/pkg/kube.(*kube).Apply\n\t/src/cluster-stack-operator/pkg/kube/kube.go:206\ngithub.com/SovereignCloudStack/cluster-stack-operator/internal/controller.(*ClusterAddonReconciler).templateAndApplyClusterAddonHelmChart\n\t/src/cluster-stack-operator/internal/controller/clusteraddon_controller.go:737\ngithub.com/SovereignCloudStack/cluster-stack-operator/internal/controller.(*ClusterAddonReconciler).Reconcile\n\t/src/cluster-stack-operator/internal/controller/clusteraddon_controller.go:276\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/src/cluster-stack-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/src/cluster-stack-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/src/cluster-stack-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/src/cluster-stack-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"
}
Nils98Ar commented 1 month ago

Seems that .spec.template.spec.securityContext.appArmorProfile was introduced in Kubernetes 1.30 and in cilium helm chart version 1.15.5 (the mentioned ClusterStack Releases use version 1.15.6). https://kubernetes.io/docs/tutorials/security/apparmor/#securing-a-pod

The helm chart should normally check the Kubernetes version using .Capabilities.KubeVersion.Version during helm install and skip the appArmorProfile for Kubernetes versions < 1.30 . Maybe this does not work in the ClusterStacks scenario? I am not sure in which context the templating is done. E.g. https://github.com/cilium/cilium/blob/v1.15.6/install/kubernetes/cilium/templates/cilium-agent/daemonset.yaml#L86-L94

Nils98Ar commented 1 month ago

These should be all relevant parts of the helm chart with checks for Kubernetes < 1.30: https://github.com/search?q=repo%3Acilium%2Fcilium%20%22%3C1.30.0%22&type=code

chess-knight commented 1 month ago

CSO does helm template | kubectl apply -f - and that's why Cilium's helm chart semverCompare logic doesn't work here. It should work for 1.30 as you wrote. But for <1.30.0 it is a bug.

Nils98Ar commented 1 month ago

Yes it does work for 1.30.

Nils98Ar commented 1 month ago

See https://github.com/SovereignCloudStack/cluster-stack-operator/issues/153.

Nils98Ar commented 1 month ago

By the way: It seems that older Kubernetes 1.28/1.29 openstack-scs releases do not work as well because of a missing security group „0“ according to cspo. But I guess as soon as the new versions work the old ones are obsolete anyway.

chess-knight commented 1 month ago

By the way: It seems that older Kubernetes 1.28/1.29 openstack-scs releases do not work as well because of a missing security group „0“ according to cspo. But I guess as soon as the new versions work the old ones are obsolete anyway.

AFAIK CSPO only cares about node images. What do you mean by security group „0“?

michal-gubricky commented 1 month ago

/kind bug

What steps did you take and what happened:

Create an openstack-scs-1-29-v1 or openstack-scs-1-28-v2 cluster.

The cluster deployment stucks at 3/3 worker nodes and 1/3 control plane node. All nodes stuck in the status NotReady. The nodes do not get an internal IP:

NAME                                   STATUS     ROLES           VERSION   INTERNAL-IP
cluster-scs-n64mk-f4xgt                NotReady   control-plane   v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-n596p   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-npxzf   NotReady   <none>          v1.29.6   <none>
cluster-scs-worker-dsvk6-56cwn-vrtdl   NotReady   <none>          v1.29.6   <none>

Different pods have the following line in their logs:

Error from server: no preferred addresses found; known addresses: []

One of the first errors in the nodes /var/log/syslog is:

cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config

The directory /etc/cni/net.d is empty on the nodes.

What did you expect to happen:

The cluster is created successfully and usable.

Hi @Nils98Ar, I just tested the creation of the cluster using the main branch of the cluster-stacks repo, built it via csctl, and did not encounter your error. The Kubernetes version is 1.28.11.

NAME                                            STATUS   ROLES           AGE     VERSION
test-cluster-5cgr8-4pj6m                        Ready    control-plane   3m38s   v1.28.11
test-cluster-5cgr8-tst5n                        Ready    control-plane   31m     v1.28.11
test-cluster-5cgr8-xdwvh                        Ready    control-plane   24m     v1.28.11
test-cluster-default-worker-b6fx8-8zrmf-2v865   Ready    <none>          28m     v1.28.11
test-cluster-default-worker-b6fx8-8zrmf-jdldc   Ready    <none>          24m     v1.28.11
test-cluster-default-worker-b6fx8-8zrmf-p5wh7   Ready    <none>          24m     v1.28.11
chess-knight commented 1 month ago

@michal-gubricky, what is the state of the ClusterAddon object?

michal-gubricky commented 1 month ago

@michal-gubricky, what is the state of the ClusterAddon object?

Here are all pods in kube-system namespace and also state of the cluster-addon resource:

ubuntu@mg-cluster-stack-vm:~$ k get clusteraddons.clusterstack.x-k8s.io cluster-addon-test-cluster 
NAME                         CLUSTER        HOOK   READY   AGE   REASON   MESSAGE
cluster-addon-test-cluster   test-cluster          true    79m 
ubuntu@mg-cluster-stack-vm:~$ k get po -n kube-system --kubeconfig test-cluster.kubeconfig 
NAME                                                     READY   STATUS    RESTARTS         AGE
cilium-fk2b9                                             1/1     Running   1                66m
cilium-gmh4x                                             1/1     Running   0                39m
cilium-l9jgw                                             1/1     Running   0                63m
cilium-lgmsv                                             1/1     Running   0                60m
cilium-mj7qz                                             1/1     Running   1 (49m ago)      60m
cilium-ncxr4                                             1/1     Running   0                52m
cilium-operator-8645b8bb4f-ppd9l                         1/1     Running   9 (3m28s ago)    66m
cilium-operator-8645b8bb4f-v9vl7                         1/1     Running   9 (5m46s ago)    66m
coredns-5dd5756b68-fhdn2                                 1/1     Running   0                66m
coredns-5dd5756b68-r7mwx                                 1/1     Running   0                66m
etcd-test-cluster-5cgr8-4pj6m                            1/1     Running   1 (19m ago)      39m
etcd-test-cluster-5cgr8-tst5n                            1/1     Running   1 (19m ago)      66m
etcd-test-cluster-5cgr8-xdwvh                            1/1     Running   0                60m
kube-apiserver-test-cluster-5cgr8-4pj6m                  1/1     Running   2 (21m ago)      39m
kube-apiserver-test-cluster-5cgr8-tst5n                  1/1     Running   5 (23m ago)      67m
kube-apiserver-test-cluster-5cgr8-xdwvh                  1/1     Running   4 (23m ago)      60m
kube-controller-manager-test-cluster-5cgr8-4pj6m         1/1     Running   1 (27m ago)      39m
kube-controller-manager-test-cluster-5cgr8-tst5n         1/1     Running   9 (3m33s ago)    66m
kube-controller-manager-test-cluster-5cgr8-xdwvh         1/1     Running   3 (5m48s ago)    60m
kube-proxy-5dhjg                                         1/1     Running   0                39m
kube-proxy-649sl                                         1/1     Running   0                60m
kube-proxy-7gs4w                                         1/1     Running   0                63m
kube-proxy-7hpxb                                         1/1     Running   0                66m
kube-proxy-c62mx                                         1/1     Running   0                52m
kube-proxy-ch5fd                                         1/1     Running   0                52m
kube-scheduler-test-cluster-5cgr8-4pj6m                  1/1     Running   2 (3m28s ago)    39m
kube-scheduler-test-cluster-5cgr8-tst5n                  1/1     Running   8 (24m ago)      67m
kube-scheduler-test-cluster-5cgr8-xdwvh                  1/1     Running   3 (5m46s ago)    60m
metrics-server-666c6745d5-d6nvf                          1/1     Running   0                66m
openstack-cinder-csi-controllerplugin-78c4557887-qhvjr   6/6     Running   17 (3m30s ago)   66m
openstack-cinder-csi-nodeplugin-qxmdq                    3/3     Running   0                60m
openstack-cinder-csi-nodeplugin-rvppn                    3/3     Running   0                63m
openstack-cinder-csi-nodeplugin-ssll9                    3/3     Running   0                39m
openstack-cinder-csi-nodeplugin-t6cql                    3/3     Running   0                66m
openstack-cinder-csi-nodeplugin-vt9z6                    3/3     Running   0                60m
openstack-cinder-csi-nodeplugin-w8lcr                    3/3     Running   0                60m
openstack-cloud-controller-manager-4vdcl                 1/1     Running   2 (2m47s ago)    52m
openstack-cloud-controller-manager-6dkfw                 1/1     Running   2 (5m48s ago)    35m
openstack-cloud-controller-manager-f84zp                 1/1     Running   4 (19m ago)      46m
chess-knight commented 1 month ago

AS @Nils98Ar wrote, the breaking change was introduced in the cilium chart version 1.15.5. The main branch installs version 1.15.2, that's why it works for you @michal-gubricky. I checked cluster-addon/Chart.lock vs cluster-addon/Chart.yaml, which differs. We are missing the helm dependency update command there. Please also check the release- branches, where it is correct.

michal-gubricky commented 1 month ago

AS @Nils98Ar wrote, the breaking change was introduced in the cilium chart version 1.15.5. The main branch installs version 1.15.2, that's why it works for you @michal-gubricky. I checked cluster-addon/Chart.lock vs cluster-addon/Chart.yaml, which differs. We are missing the helm dependency update command there. Please also check the release- branches, where it is correct.

Yeah, I was just looking at the version in Chart.yaml and there is 1.15.6.

chess-knight commented 3 weeks ago

Hi @janiskemper, can you please take a look? IMO we have three options here:

  1. Use a workaround from #150
  2. Use helm template --kube-version ... in the CSO cluster-addon logic if it is possible for this controller to know the workload k8s version. This of course needs to be tested if it is enough first. I think that also not only for the cilium helm chart it is a good idea to template charts with known k8s version, because probably multiple helm charts use Capabilities.KubeVersion.
  3. Downgrade cilium chart for k8s < 1.30