Closed anencore94 closed 2 months ago
That's strange. The only reason I could see this happening is if we somehow set the PATH such that it is referencing the host binary mount
, but the container LD_LIBRARY_PATH. @elezar do you have any thoughts on why this might be happening?
The issue is that we're running the following:
updatePathListEnvvar("PATH", filepath.Dir(nvidiaSMIPath))
which attempts to add nvidia-smi
to the PATH
. This will be at /driver-root/usr/bin
in the container and as such when we run:
mountExecutable, err := exec.LookPath("mount")
if err != nil {
return fmt.Errorf("error finding 'mount' executable: %w", err)
}
we find /driver-root/usr/bin/mount
which is the executable from the host and not in the container.
Yeah, that would do it.
Do we need to set these envvars in the plugin itself, or can they be passed to the ENV of the the exec.Command
call when we invoke nvidia-smi
?
We shouldn't need to set it for the plugin and can pass this to exec instead.
Note that for the compute mode we can also use the NVML api directly.
Thanks for clarifying 👍
Is the nvidia-smi compute-policy
corresponds to ComputeMode
in nvml ?
Yes. We are in the process of getting the NVML team to update things so that we can set a compute mode on a MIG device as well.
Thanks for the fast reply, BTW for the Question2 (Pod Deletion Stuck in Terminating State), does it resolved by #109 ? @klueska
I met some cases, after deleting the mps gpu pod, I always have to restart the kubelet of gpu worker node (systemctl restart kubelet
)
Description
During testing of the MPS-related Quickstart (using the demo script to create a kind cluster), I encountered several issues concerning the deployment of the MPS control daemon and pod deletion processes.
Issues Encountered
nvidia-k8s-dra-driver-kubelet-plugin
daemonset indicated the following errors:However, attempts to directly execute the
mount
on the host node (withdocker exec -it k8s-dra-driver-cluster-worker bash
) were succeded.Modifying the Docker image BASE_DIST from Ubuntu 20.04 to Ubuntu 22.04 (thereby updating GLIBC to version 2.37) resolved the issues with libselinux and libc but not with libmount (version mismatch continued with
MOUNT_2_38 not found
). Eventually, manually mounting/lib/x86_64-linux-gnu/libmount.so.1
from the host node to use the 2_38 version resolved the issue, allowing the MPS daemon and example pods to deploy correctly.kubectl apply
often remain stuck in a Terminating state when attempting deletion withkubectl delete
. Forcing the deletion (--force
) seems to resolve this temporarily, but any subsequent applications ofkubectl apply
result in the MPS daemon deployment failing to deploy correctly.Questions/Requests
Validation of Behavior: Is the described behavior(modifying Dockerfile and use hostpath VolumeMounts) expected, or could there be a misconfiguration or bug causing these issues? If it's an issue, I would appreciate guidance on how to proceed with a fix.
Pod Deletion Stuck in Terminating State: Is this a known issue? Are there any recommended solutions to avoid pods getting stuck in Terminating state without using
--force
?Thank you for your attention to these issues. I look forward to your insights and recommendations on these matters.