NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
226 stars 41 forks source link

driver-kubelet-plugin : error running nvidia-smi: signal: segmentation fault (core dumped) #7

Closed CoderTH closed 9 months ago

CoderTH commented 10 months ago

When I apply -f demo1.yaml and find that the pod is always in the create state, kubelet-plugin gets the following error. It seems to be an nvidia-smi error? image

I tried running nvidia-s'mi manually into the container and it was fine image

I guessed that this error was related to the following code, so I simulated running the relevant commands, and it was normal.

func (t *TimeSlicingManager) SetTimeSlice(devices *PreparedDevices, config *nascrd.TimeSlicingConfig) error {
    if devices.Mig != nil {
        return fmt.Errorf("setting a TimeSlice duration on MIG devices is unsupported")
    }

    timeSlice := nascrd.DefaultTimeSlice
    if config != nil && config.TimeSlice != nil {
        timeSlice = *config.TimeSlice
    }

    err := t.nvdevlib.setComputeMode(devices.UUIDs(), "DEFAULT")
    if err != nil {
        return fmt.Errorf("error setting compute mode: %w", err)
    }

    err = t.nvdevlib.setTimeSlice(devices.UUIDs(), timeSlice.Int())
    if err != nil {
        return fmt.Errorf("error setting time slice: %w", err)
    }

    return nil
}

image

The os versions I use are centos7.9 and tesla p4

elezar commented 10 months ago

@CoderTH with the changes that you are testing the equivalent command is no longer chroot /driver-root nvidia-smi.

Instead, we update the PATH to include the location of nvidia-smi in the driver root and set LD_LIBRARY_PATH to include the path to libnvidia-ml.so.1 which is required by nvidia-smi. It could be that this configuration is causing a different library to be loaded when running nvidia-smi to set the compute mode.

Could you check the behaviour when:

  1. Exec-ing into the container
  2. Setting PATH to include the path (in /driver-root) to nvidia-smi at the start
  3. Setting LD_LIBRARY_PATH to include the path (in /driver-root) at the start

And then running the relevant nvidia-smi commands.

My assumption is that some library in LD_LIBRARY_PATH is conflicting with a library in the container.

CoderTH commented 10 months ago

@CoderTH with the changes that you are testing the equivalent command is no longer chroot /driver-root nvidia-smi.

Instead, we update the PATH to include the location of nvidia-smi in the driver root and set LD_LIBRARY_PATH to include the path to libnvidia-ml.so.1 which is required by nvidia-smi. It could be that this configuration is causing a different library to be loaded when running nvidia-smi to set the compute mode.

Could you check the behaviour when:

  1. Exec-ing into the container
  2. Setting PATH to include the path (in /driver-root) to nvidia-smi at the start
  3. Setting LD_LIBRARY_PATH to include the path (in /driver-root) at the start

And then running the relevant nvidia-smi commands.

My assumption is that some library in LD_LIBRARY_PATH is conflicting with a library in the container.

Yes, with the method you said, I successfully reproduced the error in logs

image

elezar commented 10 months ago

Thanks for the confirmation.

Could you install strace in the container and run strace nvidia-smi? This should give us a list of the libraries that are being loaded here.

CoderTH commented 10 months ago

LD_LIBRARY_PATH

When I installed strace and set /driver-root in PATH and LD_LIBRARY_PATH, strace didn't work with the following error

image

When I removed /driver-smi, it worked, but nvidia-smi couldn't find the libnvidia-mi.so library

elezar commented 10 months ago

This means that libc.so.6 is being resolved from the host in the the container which may be causing this problem. Since this plugin is running in an ubuntu container, it would make sense to prepend the location of libc.so.6 to LD_LIBRARY_PATH to ensure that this is resolved in the container instead.

dirname $(find / | grep libc.so)
/lib/x86_64-linux-gnu

There may be other libraries that are also being aliased. What are the contents of /driver-root/lib64 in your case?

elezar commented 10 months ago

See https://gitlab.com/nvidia/cloud-native/k8s-dra-driver/-/merge_requests/26 with a proposed workaround / solution for this.

elezar commented 10 months ago

@CoderTH just for completeness, could you run ldd nvidia-smi when having an incorrect path set up?

elezar commented 10 months ago

@CoderTH this should have been addressed in https://github.com/NVIDIA/k8s-dra-driver/commit/c53420ebee2eee68d56699f0e59391b90d026b3c could you confirm that this is the case?

CoderTH commented 10 months ago

@CoderTH this should have been addressed in c53420e could you confirm that this is the case?

I am very sorry for the late reply. Some other things have occupied my time recently. I have re-tested with your latest commit today and it has failed to report segmentation fault, but new errors have been found. Does this error refer to the fact that my tesla p4 does not support timeslice model GPU sharing?

image