Cannot bring up test k8s cluster on Fedora with `run-local.sh`

tylerfanelli commented 5 months ago

Describe the bug On Fedora, when running tests/e2e/run-local.sh, when bringing up the local k8s cluster, an error occurs that prevents the cluster from being brought up.

To Reproduce Steps to reproduce the behavior:

git clone https://github.com/confidential-containers/operator.git
./operator/tests/e2e/run-local.sh -r kata-qemu-snp

Describe the results you received: All previous checks pass, yet on INFO: Bring up the test cluster, I'm met with the following:

INFO: Bring up the test cluster
[init] Using Kubernetes version: v1.24.0
[preflight] Running pre-flight checks
    [WARNING Swap]: swap is enabled; production deployments should disable swap unless testing the NodeSwap feature gate of the kubelet
    [WARNING SystemVerification]: missing optional cgroups: blkio
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [amd-milan-05.khw1.lab.eng.bos.redhat.com kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.6.6.77]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [amd-milan-05.khw1.lab.eng.bos.redhat.com localhost] and IPs [10.6.6.77 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [amd-milan-05.khw1.lab.eng.bos.redhat.com localhost] and IPs [10.6.6.77 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.

Unfortunately, an error has occurred:
    timed out waiting for the condition

This error is likely caused by:
    - The kubelet is not running
    - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
    - 'systemctl status kubelet'
    - 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
    - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
    Once you have found the failing container, you can inspect its logs with:
    - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher

Notice the warning: [WARNING Swap]: swap is enabled; production deployments should disable swap unless testing the NodeSwap feature gate of the kubelet

I've looked up the error message [kubelet-check] It seems like the kubelet isn't running or healthy. and found this post discussing the issue. One popular answer indicates that the issue is due to swap being enabled, which from the previous warning I mentioned, seems like it could be a likely cause of my problem.

So, I undo the changes made from the test: $ ./run-test.sh -u

Check the current swaps:

$ cat /proc/swaps
Filename                Type        Size        Used        Priority
/dev/zram0                              partition   8388604     0       100

Disable swap:

$ sudo swapoff -a

Check to ensure the swaps are gone:

$ cat /proc/swaps
Filename                Type        Size        Used        Priority

Seems that they are gone, so I run run.sh again: $ ./run.sh -r kata-qemu-snp

And am met with the following:

INFO: Bring up the test cluster
[init] Using Kubernetes version: v1.24.0
[preflight] Running pre-flight checks
    [WARNING Swap]: swap is enabled; production deployments should disable swap unless testing the NodeSwap feature gate of the kubelet

... snip ...

[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.

Seeing that the system reports swap being enabled again, I view the swaps being used:

$ cat /proc/swaps 
Filename                Type        Size        Used        Priority
/dev/zram0                              partition   8388604     0       100

Is there something in the system (k8s, systemd?) that is re-initializing swaps? I ensure they are disabled before the test is ran; yet after, they are re-enabled.

Has anyone seen this before or have any potential solutions? Thanks.

tylerfanelli commented 5 months ago

@wainersm You mentioned that you have ran some of these tests on Fedora before. Have you encountered this issue?

bpradipt commented 5 months ago

@tylerfanelli sometime back @c3d suggested the following to disable swap on Fedora, otherwise it comes back.

dnf remove zram-generator-defaults

tylerfanelli commented 5 months ago

@bpradipt Thanks for pointing this out, your fix seems to have solved my problem.

I see that the image was built:

$ sudo docker images

REPOSITORY                   TAG                    IMAGE ID       CREATED          SIZE
localhost:5000/cc-operator   latest                 48f322b96469   35 minutes ago   54.2MB

Yet I don't think it is ran:

$ sudo docker ps

CONTAINER ID   IMAGE            COMMAND                  CREATED          STATUS          PORTS                    NAMES
ab56de994a83   registry:2.8.1   "/entrypoint.sh /etc…"   39 minutes ago   Up 39 minutes   0.0.0.0:5000->5000/tcp   local-registry

What is the difference between cc-operator and local-registry? Is this expected? As in, do I still have to run the operator with docker run ... localhost:5000/cc-operator ..., or should it already be running after the script exits?

bpradipt commented 5 months ago

The cc-operator is the newly built operator image which is pushed to the local docker registry. That's the one you see with container id ab56de994a83. The tests create cc-operator pod on the K8s cluster using the cc-operator image in the local registry. You don't need to run it manually as run-local.sh takes care of it.

confidential-containers / operator

Cannot bring up test k8s cluster on Fedora with `run-local.sh` #319