Closed hassanbabaie closed 1 year ago
FYI, when I try and follow the general steps I hit:
make CUDA=/usr/local/cuda
make[1]: Entering directory `/usr/bin/gdrcopy/tests'
g++ -O2 -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include -c -o copybw.o copybw.cpp
copybw.cpp:30:10: fatal error: cuda.h: No such file or directory
#include <cuda.h>
^~~~~~~~
compilation terminated.
nvidia-smi works file on the host:, extract
Tue Sep 26 01:15:22 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:10:1C.0 Off | 0 |
| N/A 28C P0 62W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:10:1D.0 Off | 0 |
| N/A 27C P0 60W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
......
I just installed the following, on the host
yum install -y https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.2/rhel8/x64/gdrcopy-kmod-2.4-1dkms.el8.noarch.rpm
yum install -y https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.2/rhel8/x64/gdrcopy-devel-2.4-1.el8.noarch.rpm
yum install -y https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.2/rhel8/x64/gdrcopy-2.4-1.el8.x86_64.rpm
and then ran:
# gdrcopy_sanity
Total: 28, Passed: 28, Failed: 0, Waived: 0
Does that mean it looks good?
I don't see any files in
# ls -ls /dev/gdrdrv
0 crw-rw-rw- 1 root root 242, 0 Sep 26 01:26 /dev/gdrdrv
Hi @hassanbabaie,
If you want to install gdrdrv only, https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.2/rhel8/x64/gdrcopy-kmod-2.4-1dkms.el8.noarch.rpm
is sufficient.
# ls -ls /dev/gdrdrv
0 crw-rw-rw- 1 root root 242, 0 Sep 26 01:26 /dev/gdrdrv
Doesn't it show /dev/gdrdrv
here?
Hi @pakmarkthub, thanks for the reply:
yes so before any install I get:
# ls -ls /dev/gdrdrv
ls: cannot access /dev/gdrdrv: No such file or directory
Then if I just install:
# yum install -y https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.2/rhel8/x64/gdrcopy-kmod-2.4-1dkms.el8.noarch.rpm
I get:
# ls -ls /dev/gdrdrv
0 crw-rw-rw- 1 root root 242, 0 Sep 26 13:30 /dev/gdrdrv
So that should mean we're good and then I just need to mount this volume in the pod spec: /dev/gdrdrv
and the rest can be installed run from within the container, right?
Thanks in advance
Yes, gdrdrv is now ready on your system. Now, you just need to mount /dev/gdrdrv
. If you use docker, docker run <other options> --device=/dev/gdrdrv:/dev/gdrdrv
.
Thanks!
Hi there, it's not clear to me what are the minimal steps to only install the driver on an kubernetes node that has only the NVIDIA driver installed.
I'm looking through the issues and it would be really good if this can be broken up on the readme, and if you can let me know here too.
Thanks in advance