NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
195 stars 36 forks source link

Problems with MPS quickstart #106

Closed anencore94 closed 2 months ago

anencore94 commented 2 months ago

Description

During testing of the MPS-related Quickstart (using the demo script to create a kind cluster), I encountered several issues concerning the deployment of the MPS control daemon and pod deletion processes.

Issues Encountered

  1. MPS Control Daemon Deployment Failure: The MPS control deployment did not deploy. The logs from the nvidia-k8s-dra-driver-kubelet-plugin daemonset indicated the following errors: image
Defaulted container "plugin" out of: plugin, init (init)
I0503 03:44:08.333148       1 device_state.go:146] using devRoot=/driver-root
I0503 03:44:08.341885       1 nonblockinggrpcserver.go:105] "GRPC server started" logger="dra"
I0503 03:44:08.341960       1 nonblockinggrpcserver.go:105] "GRPC server started" logger="registrar"
I0503 03:44:17.213001       1 driver.go:104] NodePrepareResource is called: number of claims: 1
I0503 03:44:17.219672       1 sharing.go:183] Starting MPS control daemon for 'af3fbcca-a63a-4a62-8393-bf663267b4dc', with settings: &{DefaultActiveThreadPercentage:0xc0006ae510 DefaultPinnedDeviceMemoryLimit:10Gi DefaultPerDevicePinnedMemoryLimit:map[]}
E0503 03:44:17.227691       1 mount_linux.go:230] Mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t tmpfs -o rw,nosuid,nodev,noexec,relatime,size=65536k shm /var/lib/kubelet/plugins/gpu.resource.nvidia.com/mps/af3fbcca-a63a-4a62-8393-bf663267b4dc/shm
Output: mount: /lib/x86_64-linux-gnu/libselinux.so.1: no version information available (required by mount)
mount: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by mount)
mount: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by mount)
mount: /lib/x86_64-linux-gnu/libmount.so.1: version `MOUNT_2_38'

However, attempts to directly execute the mount on the host node (with docker exec -it k8s-dra-driver-cluster-worker bash) were succeded.

Modifying the Docker image BASE_DIST from Ubuntu 20.04 to Ubuntu 22.04 (thereby updating GLIBC to version 2.37) resolved the issues with libselinux and libc but not with libmount (version mismatch continued with MOUNT_2_38 not found). Eventually, manually mounting /lib/x86_64-linux-gnu/libmount.so.1 from the host node to use the 2_38 version resolved the issue, allowing the MPS daemon and example pods to deploy correctly. image image

  1. Pod Deletion Issue: Pods created with kubectl apply often remain stuck in a Terminating state when attempting deletion with kubectl delete. Forcing the deletion (--force) seems to resolve this temporarily, but any subsequent applications of kubectl apply result in the MPS daemon deployment failing to deploy correctly.

Questions/Requests

  1. Validation of Behavior: Is the described behavior(modifying Dockerfile and use hostpath VolumeMounts) expected, or could there be a misconfiguration or bug causing these issues? If it's an issue, I would appreciate guidance on how to proceed with a fix.

  2. Pod Deletion Stuck in Terminating State: Is this a known issue? Are there any recommended solutions to avoid pods getting stuck in Terminating state without using --force?

Thank you for your attention to these issues. I look forward to your insights and recommendations on these matters.

klueska commented 2 months ago

That's strange. The only reason I could see this happening is if we somehow set the PATH such that it is referencing the host binary mount, but the container LD_LIBRARY_PATH. @elezar do you have any thoughts on why this might be happening?

elezar commented 2 months ago

The issue is that we're running the following:

    updatePathListEnvvar("PATH", filepath.Dir(nvidiaSMIPath))

which attempts to add nvidia-smi to the PATH. This will be at /driver-root/usr/bin in the container and as such when we run:

    mountExecutable, err := exec.LookPath("mount")
    if err != nil {
        return fmt.Errorf("error finding 'mount' executable: %w", err)
    }

we find /driver-root/usr/bin/mount which is the executable from the host and not in the container.

klueska commented 2 months ago

Yeah, that would do it.

klueska commented 2 months ago

Do we need to set these envvars in the plugin itself, or can they be passed to the ENV of the the exec.Command call when we invoke nvidia-smi?

elezar commented 2 months ago

We shouldn't need to set it for the plugin and can pass this to exec instead.

Note that for the compute mode we can also use the NVML api directly.

anencore94 commented 2 months ago

Thanks for clarifying 👍 Is the nvidia-smi compute-policy corresponds to ComputeMode in nvml ?

klueska commented 2 months ago

Yes. We are in the process of getting the NVML team to update things so that we can set a compute mode on a MIG device as well.

anencore94 commented 2 months ago

Thanks for the fast reply, BTW for the Question2 (Pod Deletion Stuck in Terminating State), does it resolved by #109 ? @klueska I met some cases, after deleting the mps gpu pod, I always have to restart the kubelet of gpu worker node (systemctl restart kubelet)