Install/Uninstall ccruntime in a loop fails

ldoktor commented 7 months ago

Describe the bug Running a ccruntime/operator install/uninstall in a loop leads to left-behind pods

To Reproduce Steps to reproduce the behavior:

Prepare a VM:

kcli create vm -i ubuntu2204 -P memory=8G -P numcpus=4 -P disks=[50] e2e
kcli ssh e2e
sudo apt-get update -y
sudo apt-get install -y ansible python-is-python3
git clone --depth=1 https://github.com/confidential-containers/operator
cd operator/tests/e2e
export PATH="$PATH:/usr/local/bin"
ansible-playbook -i localhost, -c local --tags untagged ansible/main.yml
sudo -E PATH="$PATH" bash -c './cluster/up.sh'
export KUBECONFIG=/etc/kubernetes/admin.conf

Perform install/uninstall in a loop

export "PATH=$PATH:/usr/local/bin"
export KUBECONFIG=/etc/kubernetes/admin.conf

UP=0
TEST=0
DOWN=0

I=0
while :; do
    echo "---< START ITERATION $I: $(date) >--" | tee -a job.log; SECONDS=0
    sudo -E PATH="$PATH" timeout 25m bash -c './operator.sh' || { date; exit -1; }
    UP="$SECONDS"; SECONDS=0; echo "UP    $(date) ($UP)" | tee -a job.log
    sudo -E PATH="$PATH" timeout 25m bash -c ./tests_runner.sh -r kata-qemu || { date; exit -2; }
    TEST="$SECONDS"; SECONDS=0; echo "TESTS $(date) ($TEST)" | tee -a job.log
    sudo -E PATH="$PATH" timeout 25m bash -c './operator.sh uninstall' || { date; exit -3; }
    DOWN="$SECONDS"; SECONDS=0; echo "DOWN  $(date) ($TEST)" | tee -a job.log
    echo -e "---< END ITERATION $I: $(date) ($UP\t$TEST\t$DOWN)\t[$((UP+TEST+DOWN))] >---" | tee -a job.log
    ((I+=1))
done

Describe the results you expected It should keep creating and deleting the operator with no left-behind resources

Describe the results you received: After about 25 iterations the TEST phase took unusually longer:

---< END ITERATION 14: Thu Jan 25 19:01:56 UTC 2024 (139        413     278)    [830] >---
---< END ITERATION 15: Thu Jan 25 19:14:45 UTC 2024 (169        322     278)    [769] >---
---< END ITERATION 16: Thu Jan 25 19:25:42 UTC 2024 (169        210     278)    [657] >---
---< END ITERATION 17: Thu Jan 25 19:38:00 UTC 2024 (138        322     278)    [738] >---
---< END ITERATION 18: Thu Jan 25 19:50:45 UTC 2024 (135        352     278)    [765] >---
---< END ITERATION 19: Thu Jan 25 20:03:03 UTC 2024 (138        322     278)    [738] >---
---< END ITERATION 20: Thu Jan 25 20:16:20 UTC 2024 (167        352     278)    [797] >---
---< END ITERATION 21: Thu Jan 25 20:26:45 UTC 2024 (136        211     278)    [625] >---
---< END ITERATION 22: Thu Jan 25 20:39:30 UTC 2024 (165        322     278)    [765] >---
---< END ITERATION 23: Thu Jan 25 20:52:16 UTC 2024 (166        322     278)    [766] >---
---< END ITERATION 24: Thu Jan 25 21:03:41 UTC 2024 (165        352     168)    [685] >---
---< END ITERATION 25: Thu Jan 25 21:14:05 UTC 2024 (136        210     278)    [624] >---
---< START ITERATION 26: Thu Jan 25 21:14:05 UTC 2024 >--
UP    Thu Jan 25 21:16:51 UTC 2024 (166)
TESTS Thu Jan 25 21:27:13 UTC 2024 (622)

and following cleanup (the ./operator.sh uninstall) failed with:

ccruntime.confidentialcontainers.org "ccruntime-sample" deleted
ERROR: there are labels left behind
{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","cc-preinstald64","kubernetes.io/hostname":"e2e","kubernetes.io/os":"linux","node-role.kuberntes.io/exclude-from-external-load-balancers":"","node.kubernetes.io/worker":""}

And the confidential-containers-system contained contained:

# kubectl get all -n confidential-containers-system
NAME                                                 READY   STATUS    RESTARTS   AGE
pod/cc-operator-controller-manager-ccbbcfdf7-chwwp   2/2     Running   0          10h
pod/cc-operator-pre-install-daemon-c4287             1/1     Running   0          10h

NAME                                                     TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/cc-operator-controller-manager-metrics-service   ClusterIP   10.101.143.113   <none>        8443/TCP   10h

NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                AGE
daemonset.apps/cc-operator-pre-install-daemon   1         1         1       1            1           node.kubernetes.io/worker=   10h

NAME                                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cc-operator-controller-manager   1/1     1            1           10h

NAME                                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7   1         1         1       10h

Basically the last steps were:

It finished the operator_tests.bats
Reported the status
Started uninstalling the ccruntime by kubectl delete -k .
Waited for up to 720s for cc-operator-daemon-install and cc-operator-pre-install-daemon pods to be gone
Checked kata is not in runtime classes ! kubectl get --no-headers runtimeclass 2>/dev/null | grep -q kata
Checked the lables of the main node and found "cc-preinstall/done":"true" there and interrupted the further execution

What is odd is that the step 4 as well as 5 was true when I was checking things after the failure so it looks like the pods got deleted, perhaps even the runtime classes got deleted, but later they got re-created by the daemonset. The question is why the daemonset was not removed or whether there is another issue going on. Also why did the test took 622s while it usually takes 210 - 420s.

Additional context This issue is a reproducer of one real CI issue from https://github.com/confidential-containers/operator/issues/339

ldoktor commented 7 months ago

I reproduced it again (under a heavy load on 3rd iteration, the test phase took the usual time, though). I tried re-executing the kubectl delete which resulted in:

# kubectl delete -k .
Error from server (NotFound): error when deleting ".": ccruntimes.confidentialcontainers.org "ccruntime-sample" not found

While all pods were still there. I had extra debug outputs so the situation was:

setup + tests ran
just before the cleanup started the ns looked like this:

ccruntime: Starting the cleanupu
NAME                                                 READY   STATUS    RESTARTS   AGE
pod/cc-operator-controller-manager-ccbbcfdf7-v54gc   2/2     Running   0          42s
pod/cc-operator-daemon-install-xvqcq                 1/1     Running   0          28s
pod/cc-operator-pre-install-daemon-c4m9d             1/1     Running   0          31s

NAME                                                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/cc-operator-controller-manager-metrics-service   ClusterIP   10.107.199.43   <none>        8443/TCP   42s

NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                            AGE
daemonset.apps/cc-operator-daemon-install       1         1         1       1            1           node.kubernetes.io/worker=               28s
daemonset.apps/cc-operator-daemon-uninstall     0         0         0       0            0           katacontainers.io/kata-runtime=cleanup   31s
daemonset.apps/cc-operator-pre-install-daemon   1         1         1       1            1           node.kubernetes.io/worker=               31s

NAME                                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cc-operator-controller-manager   1/1     1            1           42s

NAME                                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7   1         1         1       42s

Then the kubectl delete -k . was issued and ! sudo -E kubectl get pods -n confidential-containers-system|grep -q -e cc-operator-daemon-install -e cc-operator-pre-install-daemon command was executed in a loop until it passed after 2 iterations (60s). Right after that the output of oc get all looked:

NAME                                                 READY   STATUS        RESTARTS   AGE
pod/cc-operator-controller-manager-ccbbcfdf7-v54gc   2/2     Running       0          103s
pod/cc-operator-daemon-install-xvqcq                 1/1     Terminating   0          89s
pod/cc-operator-pre-install-daemon-c4m9d             1/1     Running       0          92s

NAME                                                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/cc-operator-controller-manager-metrics-service   ClusterIP   10.107.199.43   <none>        8443/TCP   103s

NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                AGE
daemonset.apps/cc-operator-pre-install-daemon   1         1         1       1            1           node.kubernetes.io/worker=   92s

NAME                                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cc-operator-controller-manager   1/1     1            1           103s

NAME                                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7   1         1         1       103s

Which is odd as it should have not escaped from the loop. My assumption is the kubectl command failed and therefore the grep returned 0... Anyway nothing else happened even after a long period, the ccruntime-sample was not found and could have not been re-deleted.

To unstuck I tried re-installing it:

kubectl apply -k .
ccruntime.confidentialcontainers.org/ccruntime-sample created
# kubectl -n confidential-containers-system get all
NAME                                                 READY   STATUS    RESTARTS   AGE
pod/cc-operator-controller-manager-ccbbcfdf7-v54gc   2/2     Running   0          40m
pod/cc-operator-daemon-install-w9ppp                 1/1     Running   0          102s
pod/cc-operator-pre-install-daemon-c4m9d             1/1     Running   0          40m

NAME                                                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/cc-operator-controller-manager-metrics-service   ClusterIP   10.107.199.43   <none>        8443/TCP   40m

NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                            AGE
daemonset.apps/cc-operator-daemon-install       1         1         1       1            1           node.kubernetes.io/worker=               102s
daemonset.apps/cc-operator-daemon-uninstall     0         0         0       0            0           katacontainers.io/kata-runtime=cleanup   102s
daemonset.apps/cc-operator-pre-install-daemon   1         1         1       1            1           node.kubernetes.io/worker=               40m

NAME                                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cc-operator-controller-manager   1/1     1            1           40m

NAME                                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7   1         1         1       40m

Which, as you might see succeeded. I tried to delete it again:

# kubectl delete -k .
ccruntime.confidentialcontainers.org "ccruntime-sample" deleted
# kubectl -n confidential-containers-system get all
NAME                                                 READY   STATUS        RESTARTS   AGE
pod/cc-operator-controller-manager-ccbbcfdf7-v54gc   2/2     Running       0          44m
pod/cc-operator-daemon-install-w9ppp                 1/1     Terminating   0          5m43s

NAME                                                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/cc-operator-controller-manager-metrics-service   ClusterIP   10.107.199.43   <none>        8443/TCP   44m

NAME                                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cc-operator-controller-manager   1/1     1            1           44m

NAME                                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7   1         1         1       44m

And after few minutes I finally got the expected result.

# kubectl -n confidential-containers-system get all
NAME                                                 READY   STATUS    RESTARTS   AGE
pod/cc-operator-controller-manager-ccbbcfdf7-v54gc   2/2     Running   0          45m

NAME                                                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/cc-operator-controller-manager-metrics-service   ClusterIP   10.107.199.43   <none>        8443/TCP   45m

NAME                                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cc-operator-controller-manager   1/1     1            1           45m

NAME                                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7   1         1         1       45m

So it looks like the ccruntimie-sample got deleted but did not actually delete the deployments. The CI script then failed to detect the pods and proceeded further and failed with an unexpected error. I'll fix the CI script to not to proceed, but someone should take a look at the ccruntime-sample cleanup as apparently from time to time it gets deleted without cleaning the resources.

ldoktor commented 7 months ago

@bpradipt would you have any idea why the config/samples/ccruntime/default sometimes leaves the daemonset.apps/cc-operator-pre-install-daemon behind? I believe it's a real issue and not a CI setup issue.

confidential-containers / operator

Install/Uninstall ccruntime in a loop fails #340