Open zlianzhuang opened 2 months ago
@zlianzhuang on node reboot, it takes around 3-5 minutes for the GPU stack to be ready (driver installation, container-toolkit setup etc), these errors are expected before the stack is ready. Using pre-compiled drivers will minimize this delay, but that feature is not yet GA. Please make sure that the driver images are available for the kernel you are using.
1. Quick Debug Information
2. Issue or feature descriptionn
node reboot. when the pod start. nvidia-smi can't use. "nvidia-smi": executable file not found in $PATH: unknown
3. Steps to reproduce the issue
a) create a nvidia pod with nodeselector on x node b) reboot x node c) pod start. nvidia-smi executable file not found.
4. Information to [attach]
kubectl replace the pod. nvidia-smi is executable.