Open wjentner opened 2 years ago
@wjentner Please find details about how driver pod restarts/upgrades are handled here. For drain failures you can tweak these parameters if that works for you. Otherwise node reboot is required.
driver:
manager:
env:
- name: ENABLE_AUTO_DRAIN
value: "true"
- name: DRAIN_USE_FORCE
value: "false"
- name: DRAIN_POD_SELECTOR_LABEL
value: ""
- name: DRAIN_TIMEOUT_SECONDS
value: "0s"
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "false"
Also the error nvidia-driver-daemonset-c78b4 nvidia-driver-ctr rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: connection refused"
soon after driver install is expected as "container-toolkit" pod will reload containerd soon after driver is ready. Do you see issue in starting GPU pods after you see this error?
Thank you for the explanation @shivamerla I adjusted the params.
Regarding the second problem:
We are unsure what the cause is. After some time (i.e. 8 days of uptime on the node).
New pods do not get injected with the drivers any more. CUDA is not working and also nvidia-smi
is command not found.
I created a test daemonset that schedules a pod on our GPU nodes which continuously executes nvidia-smi
. In the latest outage I could observe something strange:
nvidia-smi
.Note that we did not upgrade the drivers nor did any anything else on the node. We do not know what caused this, but it has happened before.
To better see any new failures, I created a CronJob in addition to the DaemonSet that starts a Pod on GPU nodes and executes nvidia-smi
.
As of now, the uptime of the node is only 2.5 days and everything functions normally.
From experiences, the problems start occurring after 10 to 15 days of uptime.
Thanks @wjentner. When this happens can you check if the mount and all files under /run/nvidia/driver
are intact on the node. We should also debug from container-toolkit to understand why device/file injection is failing. Can you add following debug entries in /usr/local/nvidia/toolkit/nvidia-container-runtime/.config.toml
as mentioned here.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
The driver pod fails to drain the node:
After manual draining the following error appears:
Note that while the driver errors, containerd functions normally and new pods (not using any gpu) can be successfully deployed on the node.
2. Steps to reproduce the issue
Unknown. After the node is running for a longer time, the cuda drivers cannot be injected anymore into pods. So far the only workaround for this problem is to restart the node. Afterward, the driver can be installed successfully.
3. Information to attach (optional if deemed irrelevant)
[ ] kubernetes pods status:
kubectl get pods --all-namespaces
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo
[ ] Docker configuration file:
cat /etc/docker/daemon.json
[ ] Docker runtime configuration:
docker info | grep runtime
[ ] NVIDIA shared directory:
ls -la /run/nvidia
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs