Open ldoktor opened 7 months ago
I reproduced it again (under a heavy load on 3rd iteration, the test phase took the usual time, though). I tried re-executing the kubectl delete
which resulted in:
# kubectl delete -k .
Error from server (NotFound): error when deleting ".": ccruntimes.confidentialcontainers.org "ccruntime-sample" not found
While all pods were still there. I had extra debug outputs so the situation was:
ccruntime: Starting the cleanupu
NAME READY STATUS RESTARTS AGE
pod/cc-operator-controller-manager-ccbbcfdf7-v54gc 2/2 Running 0 42s
pod/cc-operator-daemon-install-xvqcq 1/1 Running 0 28s
pod/cc-operator-pre-install-daemon-c4m9d 1/1 Running 0 31s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cc-operator-controller-manager-metrics-service ClusterIP 10.107.199.43 <none> 8443/TCP 42s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/cc-operator-daemon-install 1 1 1 1 1 node.kubernetes.io/worker= 28s
daemonset.apps/cc-operator-daemon-uninstall 0 0 0 0 0 katacontainers.io/kata-runtime=cleanup 31s
daemonset.apps/cc-operator-pre-install-daemon 1 1 1 1 1 node.kubernetes.io/worker= 31s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cc-operator-controller-manager 1/1 1 1 42s
NAME DESIRED CURRENT READY AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7 1 1 1 42s
Then the kubectl delete -k .
was issued and ! sudo -E kubectl get pods -n confidential-containers-system|grep -q -e cc-operator-daemon-install -e cc-operator-pre-install-daemon
command was executed in a loop until it passed after 2 iterations (60s). Right after that the output of oc get all looked:
NAME READY STATUS RESTARTS AGE
pod/cc-operator-controller-manager-ccbbcfdf7-v54gc 2/2 Running 0 103s
pod/cc-operator-daemon-install-xvqcq 1/1 Terminating 0 89s
pod/cc-operator-pre-install-daemon-c4m9d 1/1 Running 0 92s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cc-operator-controller-manager-metrics-service ClusterIP 10.107.199.43 <none> 8443/TCP 103s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/cc-operator-pre-install-daemon 1 1 1 1 1 node.kubernetes.io/worker= 92s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cc-operator-controller-manager 1/1 1 1 103s
NAME DESIRED CURRENT READY AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7 1 1 1 103s
Which is odd as it should have not escaped from the loop. My assumption is the kubectl
command failed and therefore the grep returned 0... Anyway nothing else happened even after a long period, the ccruntime-sample
was not found and could have not been re-deleted.
To unstuck I tried re-installing it:
kubectl apply -k .
ccruntime.confidentialcontainers.org/ccruntime-sample created
# kubectl -n confidential-containers-system get all
NAME READY STATUS RESTARTS AGE
pod/cc-operator-controller-manager-ccbbcfdf7-v54gc 2/2 Running 0 40m
pod/cc-operator-daemon-install-w9ppp 1/1 Running 0 102s
pod/cc-operator-pre-install-daemon-c4m9d 1/1 Running 0 40m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cc-operator-controller-manager-metrics-service ClusterIP 10.107.199.43 <none> 8443/TCP 40m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/cc-operator-daemon-install 1 1 1 1 1 node.kubernetes.io/worker= 102s
daemonset.apps/cc-operator-daemon-uninstall 0 0 0 0 0 katacontainers.io/kata-runtime=cleanup 102s
daemonset.apps/cc-operator-pre-install-daemon 1 1 1 1 1 node.kubernetes.io/worker= 40m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cc-operator-controller-manager 1/1 1 1 40m
NAME DESIRED CURRENT READY AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7 1 1 1 40m
Which, as you might see succeeded. I tried to delete it again:
# kubectl delete -k .
ccruntime.confidentialcontainers.org "ccruntime-sample" deleted
# kubectl -n confidential-containers-system get all
NAME READY STATUS RESTARTS AGE
pod/cc-operator-controller-manager-ccbbcfdf7-v54gc 2/2 Running 0 44m
pod/cc-operator-daemon-install-w9ppp 1/1 Terminating 0 5m43s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cc-operator-controller-manager-metrics-service ClusterIP 10.107.199.43 <none> 8443/TCP 44m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cc-operator-controller-manager 1/1 1 1 44m
NAME DESIRED CURRENT READY AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7 1 1 1 44m
And after few minutes I finally got the expected result.
# kubectl -n confidential-containers-system get all
NAME READY STATUS RESTARTS AGE
pod/cc-operator-controller-manager-ccbbcfdf7-v54gc 2/2 Running 0 45m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cc-operator-controller-manager-metrics-service ClusterIP 10.107.199.43 <none> 8443/TCP 45m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cc-operator-controller-manager 1/1 1 1 45m
NAME DESIRED CURRENT READY AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7 1 1 1 45m
So it looks like the ccruntimie-sample
got deleted but did not actually delete the deployments. The CI script then failed to detect the pods and proceeded further and failed with an unexpected error. I'll fix the CI script to not to proceed, but someone should take a look at the ccruntime-sample
cleanup as apparently from time to time it gets deleted without cleaning the resources.
@bpradipt would you have any idea why the config/samples/ccruntime/default
sometimes leaves the daemonset.apps/cc-operator-pre-install-daemon
behind? I believe it's a real issue and not a CI setup issue.
Describe the bug Running a ccruntime/operator install/uninstall in a loop leads to left-behind pods
To Reproduce Steps to reproduce the behavior:
Describe the results you expected It should keep creating and deleting the operator with no left-behind resources
Describe the results you received: After about 25 iterations the TEST phase took unusually longer:
and following cleanup (the
./operator.sh uninstall
) failed with:And the confidential-containers-system contained contained:
Basically the last steps were:
operator_tests.bats
ccruntime
bykubectl delete -k .
cc-operator-daemon-install
andcc-operator-pre-install-daemon
pods to be gonekata
is not in runtime classes! kubectl get --no-headers runtimeclass 2>/dev/null | grep -q kata
"cc-preinstall/done":"true"
there and interrupted the further executionWhat is odd is that the step
4
as well as5
was true when I was checking things after the failure so it looks like the pods got deleted, perhaps even the runtime classes got deleted, but later they got re-created by the daemonset. The question is why the daemonset was not removed or whether there is another issue going on. Also why did the test took 622s while it usually takes 210 - 420s.Additional context This issue is a reproducer of one real CI issue from https://github.com/confidential-containers/operator/issues/339