Closed kamal-rahimi closed 3 days ago
Looks like it is related to the cuda-compat package not supporting 550 drivers.
This PR is very related and tried to solve this similar mismatching issue. But ended up having a note :
## NOTE: if `nccl-tests` or `/opt/gdrcopy/bin/sanity -v` crashes with incompatible version, ensure
## that the cuda-compat-xx-x package is the latest.
The sanity indeed fails for new driver 550 in host and cuda 12.2 in container.
I'm not quite sure if in this case where the driver in host is support higher version of cuda in container, if we need a to use the cuda-compat package at all here
@mhuguesaws @verdimrc I think the best way forward is to have this PR. I dont know if that's feasible. Any ideas?
Another alternative is to remove /usr/local/cuda/compat
from LD_LIBRARY_PATH and LIBRARY_PATH.
Thank you for your time.
Can you provide the command line you use to start the container?
Can you provide the command line you use to start the container?
I just found out I need to use docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --device=/dev/gdrdrv -it --rm <IMAGE-NAME> /bin/bash
In the updated cuda /opt/gdrcopy/bin/sanity -v
passes.
Also if upgrading cuda is not possible
installing apt install cuda-compat-12-4
and removing apt purge cuda-compat-12-2
passes the /opt/gdrcopy/bin/sanity -v
for the same 12.2 CUDA. It should fix the problem.
@mhuguesaws - I was running the docker using docker run -it --gpus=all IMAGE /bin/bash
on the previous AMIs that works. I also tried docker run -it --gpus=all --privileged IMAGE /bin/bash
Looking into NCCL test that I believed we run w/ driver 550.90.07 with cuda 12.2 without problems.
@mhuguesaws: the new AMIs have cuda 12.4. The solution suggested by @jahaniam resolves the issue for us:
RUN apt purge -y cuda-compat-12-2 || true
RUN apt install -y cuda-compat-12-4 && ln -s /usr/local/cuda-12.4/compat /usr/local/cuda/compat
@kamal-rahimi what orchestration do you use, Kubernetes or something else?
Looks like we can remove the cuda-compat from the base image since gdrcopy only requires libcuda.so
for driver build. In the container we are only building the librairies.
Also the cuda-compat module is only for forward compatiblity. Since drivers on AWS are 535 and above that supports cuda 12.2, that should be sufficient.
@mhuguesaws I agree. We shouldnt need cuda-compat module.
@mhuguesaws please review https://github.com/aws-samples/awsome-distributed-training/pull/476.
We are building a docker based on the test docker in https://github.com/aws-samples/awsome-distributed-training/blob/main/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile and install Pytorch using
This docker used to work fine and we could train models on DLAMIs specifically:
Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.0 (Ubuntu 20.04) 20240825
which has the following driver and CUDA versions installed:
However, now that we are switching to newer AMI:
Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) 20241027
that has the following driver and cuda version:We get the following error when running PyTorch:
We need to use the latest AMIs for security updates. Could you please let me know how this issue can be resolved?