Closed CoderTH closed 9 months ago
@CoderTH with the changes that you are testing the equivalent command is no longer chroot /driver-root nvidia-smi
.
Instead, we update the PATH
to include the location of nvidia-smi
in the driver root and set LD_LIBRARY_PATH
to include the path to libnvidia-ml.so.1
which is required by nvidia-smi
. It could be that this configuration is causing a different library to be loaded when running nvidia-smi
to set the compute mode.
Could you check the behaviour when:
PATH
to include the path (in /driver-root
) to nvidia-smi
at the startLD_LIBRARY_PATH
to include the path (in /driver-root
) at the startAnd then running the relevant nvidia-smi
commands.
My assumption is that some library in LD_LIBRARY_PATH
is conflicting with a library in the container.
@CoderTH with the changes that you are testing the equivalent command is no longer
chroot /driver-root nvidia-smi
.Instead, we update the
PATH
to include the location ofnvidia-smi
in the driver root and setLD_LIBRARY_PATH
to include the path tolibnvidia-ml.so.1
which is required bynvidia-smi
. It could be that this configuration is causing a different library to be loaded when runningnvidia-smi
to set the compute mode.Could you check the behaviour when:
- Exec-ing into the container
- Setting
PATH
to include the path (in/driver-root
) tonvidia-smi
at the start- Setting
LD_LIBRARY_PATH
to include the path (in/driver-root
) at the startAnd then running the relevant
nvidia-smi
commands.My assumption is that some library in
LD_LIBRARY_PATH
is conflicting with a library in the container.
Yes, with the method you said, I successfully reproduced the error in logs
Thanks for the confirmation.
Could you install strace
in the container and run strace nvidia-smi
? This should give us a list of the libraries that are being loaded here.
LD_LIBRARY_PATH
When I installed strace and set /driver-root in PATH and LD_LIBRARY_PATH, strace didn't work with the following error
When I removed /driver-smi, it worked, but nvidia-smi couldn't find the libnvidia-mi.so library
This means that libc.so.6
is being resolved from the host in the the container which may be causing this problem. Since this plugin is running in an ubuntu container, it would make sense to prepend the location of libc.so.6
to LD_LIBRARY_PATH
to ensure that this is resolved in the container instead.
dirname $(find / | grep libc.so)
/lib/x86_64-linux-gnu
There may be other libraries that are also being aliased. What are the contents of /driver-root/lib64
in your case?
See https://gitlab.com/nvidia/cloud-native/k8s-dra-driver/-/merge_requests/26 with a proposed workaround / solution for this.
@CoderTH just for completeness, could you run ldd nvidia-smi
when having an incorrect path set up?
@CoderTH this should have been addressed in https://github.com/NVIDIA/k8s-dra-driver/commit/c53420ebee2eee68d56699f0e59391b90d026b3c could you confirm that this is the case?
@CoderTH this should have been addressed in c53420e could you confirm that this is the case?
I am very sorry for the late reply. Some other things have occupied my time recently. I have re-tested with your latest commit today and it has failed to report segmentation fault, but new errors have been found. Does this error refer to the fact that my tesla p4 does not support timeslice model GPU sharing?
When I apply -f demo1.yaml and find that the pod is always in the create state, kubelet-plugin gets the following error. It seems to be an nvidia-smi error?
I tried running nvidia-s'mi manually into the container and it was fine
I guessed that this error was related to the following code, so I simulated running the relevant commands, and it was normal.
The os versions I use are centos7.9 and tesla p4