Open dbugit opened 2 years ago
@dbugit i see that you are using ubi8 toolkit images, please change it to 1.7.2-centos7
and give it a try. Also, if you can try with 1.5+ containerd that will be good.
Changing to the 1.7.2-centos7
image didn't improve the timing issues at all but did seem to clean driver components off of the node better. The ubi8
image would often leave behind containers/processes, and prevent other things from happening, like cleaning up iptables
and umounting volumes.
But this leads me to a bigger question. As far as I can tell, the toolkit is the only image built specifically for CentOS 7. Other component images only support UBI 8 or Ubuntu, and still others are platform agnostic. Why the mix? Shouldn't all the images be built for and tested against the same platform?
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
The
gpu-operator
deploys and runs in our test cluster just fine, and the canned examples return the expected results. However, when uninstalling the operator, all of its related Pods remain in aTerminating
state for 25-30 minutes before actually terminating, during which timecontainerd
is inaccessible. Is this normal?2. Steps to reproduce the issue
Given an override.yaml file as such:
deploy via Helm:
helm install gpu-test nvidia/gpu-operator --version 1.9.0 -n gpu-test
(note that thegpu-test
namespace is created beforehand).After verifying that the validators finish and all other Pods are in a
Running
state, I let the cluster sit for about 10 minutes and then remove the operator with the commandhelm uninstall gpu-test -n gpu-test
. I verify theTerminating
state with repeated calls tokubectl get pods -n gpu-test
, sometimes viawatch
if I'm really feeling lazy.3. Information to attach (optional if deemed irrelevant)
While the cluster is running and before uninstalling the
gpu-operator
, I observe the following. Note that these logs were recorded during different runs at different times, so not everything is from the same test or in chronological order. Note also thatnode007
is the one GPU node in the test cluster -- and yes, it has a license to kill pods on that node.but then after issuing the
helm uninstall
command, during theTerminating
state:Also during the
Terminating
state,kubelet
is logging a seemingly endless stream of these types of messages:while
containerd
logs countless iterations of these lines:Note that, during every iteration of the log messages,
containerd
always times out after loading theio.containerd.runtime.v2.task
plugin.kubectl get pods -n gpu-test
kubectl get ds -n gpu-test
(empty list)kubectl describe pod -n NAMESPACE POD_NAME
kubectl logs -n NAMESPACE POD_NAME
ls -la /run/nvidia
ls -la /usr/local/nvidia/toolkit
ls -la /run/nvidia/driver
journalctl -u kubelet > kubelet.logs
(see above)