NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.16k stars 795 forks source link

How the NCCL support both TensorFlow1 and TensorFlow2 environments? #889

Open yjiangling opened 1 year ago

yjiangling commented 1 year ago

Hi, I'm using TensorFlow and NCCL、MPI to use multi GPUs to train model, but meet some proble. I have two environments build with Anaconda----TensorFlow 1.x and TensorFlow 2.x,they use different CUDA version----CUDA 10.0 and CUDA 11.0,when I use TensorFlow 1.x to cunduct multi GPUs training, everything seems OK, but when I use TensorFlow2.x (code for Keras model train with horovod), it goes wrong. The information of NCCL is below:

Sorting... Done
Full Text Search... Done
libhttpasyncclient-java/bionic,bionic 4.1.3-1 all
  HTTP/1.1 compliant asynchronous HTTP agent implementation

libnccl-dev/unknown,now 2.4.8-1+cuda10.0 amd64 [installed]
  NVIDIA Collectives Communication Library (NCCL) Development Files

libnccl2/unknown,now 2.4.8-1+cuda10.0 amd64 [installed]
  NVIDIA Collectives Communication Library (NCCL) Runtime

libpuppetlabs-http-client-clojure/bionic,bionic 0.9.0-1 all
  Clojure wrapper around libhttpasyncclient-java

libunirest-java-java/bionic,bionic 1.4.8-2 all
  Simplified, lightweight HTTP client library

libvncclient1/bionic-updates,bionic-security,bionic-updates,bionic-security 0.9.11+dfsg-1ubuntu1.4 amd64
  API to write one's own VNC server - client library

libvncclient1-dbg/bionic-updates,bionic-security,bionic-updates,bionic-security 0.9.11+dfsg-1ubuntu1.4 amd64
  debugging symbols for libvncclient

nccl-repo-ubuntu1804-2.4.8-ga-cuda10.0/now 1-1 amd64 [installed,local]
  nccl repository configuration files

python-ncclient/bionic,bionic 0.5.3-4 all
  Python library for NETCONF clients (Python 2)

python-ncclient-doc/bionic,bionic 0.5.3-4 all
  Documentation for python-ncclient (Python library for NETCONF clients)

python3-ncclient/bionic,bionic 0.5.3-4 all
  Python library for NETCONF clients (Python 3)

texlive-latex-extra/bionic,bionic 2017.20180305-2 all
  TeX Live: LaTeX additional packages

xvnc4viewer/bionic,bionic 4.1.1+xorg4.3.0-37.3ubuntu2 amd64
  Virtual network computing client software for X

Here is some questions:

  1. The readme file saied the NCCL is installed in /user/local/cuda path, that is to say the installed NCCL version is for CUDA 10.0 (the soft link of cud is for CUDA 10.0) as dispalyed above, it can be normally used in TensorFlow1.x environment. But in CUDA 11.0 environment (TensorFlow2.x), the machine can use this NCCL? Or should I install two different NCCL for different CUDA in a machine?

  2. The NCCL Version need to be more than 2.7 in TensorFlow2.x? As I test the NCCL with make -j MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/, it give error: identifier "ncclSend" is undefined as below: 捕获

But in another machine, nccl-repo-ubuntu1804-2.8.3-ga-cuda11.1/now 1-1 amd64 is installed, everything is OK, two CUDA version-----CUDA 11.1 and CUDA 11.6 for TensorFlow1.x and TensorFlow2.x environments, NCCL works fine in both environments, and the nccl-tests is also OK. What's wrong?

  1. If the reason is thre NCCL version, then how use NCCL in low cuda version, such as CUDA 11.0 (the highest NCCL version for CUDA 10.0 is 2.6.4, no 2.7above version for this CUDA, see https://developer.nvidia.com/nccl/nccl-legacy-downloads)?

I'm quite confused by this problem, any one can give me some suggestion? Thanks a lot in advance! I would greatly appreciate it if someone could help me answer these questions.

yjiangling commented 1 year ago

By the way, the machine that NCCL can not be used in TensorFlow2.x environment is NVIDIA GeForce 1080Ti GPUs, another machine is NVIDIA GeForce 3080Ti GPUs.

sjeaugey commented 1 year ago

The NCCL perf tests may need support for ncclSend/ncclRecv indeed, which was added in NCCL 2.7. The current version is NCCL 2.18 so 2.7 is old and 2.4 is very old.

You may be able to use a NCCL library compiled for CUDA 10.2 on CUDA 10.0 ... not sure but maybe worth a try. You can get up to NCCL 2.15.5 for CUDA 10.2.

Another option is simply to recompile NCCL for CUDA 10.0 yourself. It's easy, just run:

git clone https://github.com/nvidia/nccl
cd nccl
make -j

And the library should be in build/lib. If you want to create a proper debian package (instead of setting LD_LIBRARY_PATH) you can run make pkg.debian.build then dpkg -i build/pkg/deb/*.deb.

yjiangling commented 1 year ago

The NCCL perf tests may need support for ncclSend/ncclRecv indeed, which was added in NCCL 2.7. The current version is NCCL 2.18 so 2.7 is old and 2.4 is very old.

You may be able to use a NCCL library compiled for CUDA 10.2 on CUDA 10.0 ... not sure but maybe worth a try. You can get up to NCCL 2.15.5 for CUDA 10.2.

Another option is simply to recompile NCCL for CUDA 10.0 yourself. It's easy, just run:

git clone https://github.com/nvidia/nccl
cd nccl
make -j

And the library should be in build/lib. If you want to create a proper debian package (instead of setting LD_LIBRARY_PATH) you can run make pkg.debian.build then dpkg -i build/pkg/deb/*.deb.

Yes, thanks a lot for the help, I tried to conpile NCCL 2.8.3 for CUDA 10.0 with the following steps (Only the last step is different with your suggestion, I will try later with *dpkg -i build/pkg/deb/.deb**.):

# 编译nccl
git clone https://github.com/NVIDIA/nccl.git
cd nccl && git checkout v2.8.3-1
make -j src.build

# 如果第一次安装,需要安装一下依赖
# Install tools to create debian packages
sudo apt install build-essential devscripts debhelper fakeroot
# Build NCCL deb package
make pkg.debian.build
ls build/pkg/deb/

# install
sudo make install

But when I run NCCL_DEBUG=INFO mpirun -n 2 python3 train.py, it gives the following errors: 捕获

Is still the reason of NCCL version? I will try NCCL 2.15.5 for CUDA 10.2 ? (My machine have CUDA10.0 and CUDA11.0, the CUDA path is set to point to CUDA10.0). By the way, the CUDA10.0 and CUDA11.0 use the same NCCL is OK ? Because in TensorFlow2.x, the CUDA11.0 is used in fact, or should I install two different NCCL for differernt CUDA?

sjeaugey commented 1 year ago

That's a bit weird. It looks like NCCL was not compiled with your GPU architecture.

On which GPU did that error happen? The 1080 or the 3080?

According to https://developer.nvidia.com/cuda-gpus, the 1080 should need CUDA arch 6.1 and the 3080 would need 8.6. Maybe you can recompile NCCL with the following options to make sure you have the right archs:

make clean
make -j NVCC_GENCODE="-gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_86,code=sm_86"
sjeaugey commented 1 year ago

As for CUDA versions, it is advised to use a NCCL library compiled with the same CUDA version as the framework.

You can recompile NCCL for each CUDA version with:

make <make options> CUDA_HOME=/usr/local/cuda-10.0 BUILDDIR=build-10.0
make <make options> CUDA_HOME=/usr/local/cuda-11.0 BUILDDIR=build-11.0

Then set LD_LIBRARY_PATH=$NCCL_HOME/build-10.0/lib:$LD_LIBRARY_PATH when you launch with TF1+CUDA 10.0 and LD_LIBRARY_PATH=$NCCL_HOME/build-11.0/lib:$LD_LIBRARY_PATH when you use TF2+CUDA 11.0.

yjiangling commented 1 year ago

That's a bit weird. It looks like NCCL was not compiled with your GPU architecture.

On which GPU did that error happen? The 1080 or the 3080?

According to https://developer.nvidia.com/cuda-gpus, the 1080 should need CUDA arch 6.1 and the 3080 would need 8.6. Maybe you can recompile NCCL with the following options to make sure you have the right archs:

make clean
make -j NVCC_GENCODE="-gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_86,code=sm_86"

Thanks, On 1080Ti GPUs give this error in TensorFlow2.x environment, the 3080 is OK in both TensorFlow1.x and TensorFlow2.x. Maybe it really do not call the reinstalled new NCCL, because I see it give the logs info like this:

1687253876712

It still use the NCCL2.4.8?But I uninstalled it, what's wrong? I am recomile the NCCL with NVCC_GENCODE="-gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_86,code=sm_86 now

yjiangling commented 1 year ago

As for CUDA versions, it is advised to use a NCCL library compiled with the same CUDA version as the framework.

You can recompile NCCL for each CUDA version with:

make <make options> CUDA_HOME=/usr/local/cuda-10.0 BUILDDIR=build-10.0
make <make options> CUDA_HOME=/usr/local/cuda-11.0 BUILDDIR=build-11.0

Then set LD_LIBRARY_PATH=$NCCL_HOME/build-10.0/lib:$LD_LIBRARY_PATH when you launch with TF1+CUDA 10.0 and LD_LIBRARY_PATH=$NCCL_HOME/build-11.0/lib:$LD_LIBRARY_PATH when you use TF2+CUDA 11.0.

Ok, thanks again, that is to say, install two different for each CUDA, right? I will try it right now.