aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
200 stars 83 forks source link

EFA Does not Work on New NVIDA Driver and CUDA Versions `system has unsupported display driver ` #475

Closed kamal-rahimi closed 3 days ago

kamal-rahimi commented 1 week ago

We are building a docker based on the test docker in https://github.com/aws-samples/awsome-distributed-training/blob/main/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile and install Pytorch using

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

This docker used to work fine and we could train models on DLAMIs specifically: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.0 (Ubuntu 20.04) 20240825

which has the following driver and CUDA versions installed:

ubuntu@ip-10-67-121-83:~$ nvidia-smi
Thu Oct 31 20:55:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   32C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
ubuntu@ip-10-67-121-83:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

However, now that we are switching to newer AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) 20241027 that has the following driver and cuda version:

ubuntu@ip-10-67-120-15:~$ nvidia-smi
Thu Oct 31 21:06:58 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   33C    P8             10W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
ubuntu@ip-10-67-120-15:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

We get the following error when running PyTorch:

/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

We need to use the latest AMIs for security updates. Could you please let me know how this issue can be resolved?

jahaniam commented 6 days ago

Looks like it is related to the cuda-compat package not supporting 550 drivers.

This PR is very related and tried to solve this similar mismatching issue. But ended up having a note :

## NOTE: if `nccl-tests` or `/opt/gdrcopy/bin/sanity -v` crashes with incompatible version, ensure
## that the cuda-compat-xx-x package is the latest.

The sanity indeed fails for new driver 550 in host and cuda 12.2 in container.

I'm not quite sure if in this case where the driver in host is support higher version of cuda in container, if we need a to use the cuda-compat package at all here

@mhuguesaws @verdimrc I think the best way forward is to have this PR. I dont know if that's feasible. Any ideas? Another alternative is to remove /usr/local/cuda/compat from LD_LIBRARY_PATH and LIBRARY_PATH.

Thank you for your time.

mhuguesaws commented 6 days ago

Can you provide the command line you use to start the container?

jahaniam commented 6 days ago

Can you provide the command line you use to start the container?

I just found out I need to use docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --device=/dev/gdrdrv -it --rm <IMAGE-NAME> /bin/bash

In the updated cuda /opt/gdrcopy/bin/sanity -v passes.

jahaniam commented 6 days ago

Also if upgrading cuda is not possible installing apt install cuda-compat-12-4 and removing apt purge cuda-compat-12-2 passes the /opt/gdrcopy/bin/sanity -v for the same 12.2 CUDA. It should fix the problem.

kamal-rahimi commented 6 days ago

@mhuguesaws - I was running the docker using docker run -it --gpus=all IMAGE /bin/bash on the previous AMIs that works. I also tried docker run -it --gpus=all --privileged IMAGE /bin/bash

mhuguesaws commented 3 days ago

Looking into NCCL test that I believed we run w/ driver 550.90.07 with cuda 12.2 without problems.

kamal-rahimi commented 3 days ago

@mhuguesaws: the new AMIs have cuda 12.4. The solution suggested by @jahaniam resolves the issue for us:

RUN apt purge -y cuda-compat-12-2 || true
RUN apt install -y cuda-compat-12-4 && ln -s /usr/local/cuda-12.4/compat /usr/local/cuda/compat
mhuguesaws commented 3 days ago

@kamal-rahimi what orchestration do you use, Kubernetes or something else?

mhuguesaws commented 3 days ago

Looks like we can remove the cuda-compat from the base image since gdrcopy only requires libcuda.so for driver build. In the container we are only building the librairies.

Also the cuda-compat module is only for forward compatiblity. Since drivers on AWS are 535 and above that supports cuda 12.2, that should be sufficient.

jahaniam commented 3 days ago

@mhuguesaws I agree. We shouldnt need cuda-compat module.

jahaniam commented 3 days ago

@mhuguesaws please review https://github.com/aws-samples/awsome-distributed-training/pull/476.