Open yjiangling opened 1 year ago
By the way, the machine that NCCL can not be used in TensorFlow2.x environment is NVIDIA GeForce 1080Ti GPUs, another machine is NVIDIA GeForce 3080Ti GPUs.
The NCCL perf tests may need support for ncclSend/ncclRecv indeed, which was added in NCCL 2.7. The current version is NCCL 2.18 so 2.7 is old and 2.4 is very old.
You may be able to use a NCCL library compiled for CUDA 10.2 on CUDA 10.0 ... not sure but maybe worth a try. You can get up to NCCL 2.15.5 for CUDA 10.2.
Another option is simply to recompile NCCL for CUDA 10.0 yourself. It's easy, just run:
git clone https://github.com/nvidia/nccl
cd nccl
make -j
And the library should be in build/lib. If you want to create a proper debian package (instead of setting LD_LIBRARY_PATH
) you can run make pkg.debian.build
then dpkg -i build/pkg/deb/*.deb
.
The NCCL perf tests may need support for ncclSend/ncclRecv indeed, which was added in NCCL 2.7. The current version is NCCL 2.18 so 2.7 is old and 2.4 is very old.
You may be able to use a NCCL library compiled for CUDA 10.2 on CUDA 10.0 ... not sure but maybe worth a try. You can get up to NCCL 2.15.5 for CUDA 10.2.
Another option is simply to recompile NCCL for CUDA 10.0 yourself. It's easy, just run:
git clone https://github.com/nvidia/nccl cd nccl make -j
And the library should be in build/lib. If you want to create a proper debian package (instead of setting
LD_LIBRARY_PATH
) you can runmake pkg.debian.build
thendpkg -i build/pkg/deb/*.deb
.
Yes, thanks a lot for the help, I tried to conpile NCCL 2.8.3 for CUDA 10.0 with the following steps (Only the last step is different with your suggestion, I will try later with *dpkg -i build/pkg/deb/.deb**.):
# 编译nccl
git clone https://github.com/NVIDIA/nccl.git
cd nccl && git checkout v2.8.3-1
make -j src.build
# 如果第一次安装,需要安装一下依赖
# Install tools to create debian packages
sudo apt install build-essential devscripts debhelper fakeroot
# Build NCCL deb package
make pkg.debian.build
ls build/pkg/deb/
# install
sudo make install
But when I run NCCL_DEBUG=INFO mpirun -n 2 python3 train.py, it gives the following errors:
Is still the reason of NCCL version? I will try NCCL 2.15.5 for CUDA 10.2 ? (My machine have CUDA10.0 and CUDA11.0, the CUDA path is set to point to CUDA10.0). By the way, the CUDA10.0 and CUDA11.0 use the same NCCL is OK ? Because in TensorFlow2.x, the CUDA11.0 is used in fact, or should I install two different NCCL for differernt CUDA?
That's a bit weird. It looks like NCCL was not compiled with your GPU architecture.
On which GPU did that error happen? The 1080 or the 3080?
According to https://developer.nvidia.com/cuda-gpus, the 1080 should need CUDA arch 6.1 and the 3080 would need 8.6. Maybe you can recompile NCCL with the following options to make sure you have the right archs:
make clean
make -j NVCC_GENCODE="-gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_86,code=sm_86"
As for CUDA versions, it is advised to use a NCCL library compiled with the same CUDA version as the framework.
You can recompile NCCL for each CUDA version with:
make <make options> CUDA_HOME=/usr/local/cuda-10.0 BUILDDIR=build-10.0
make <make options> CUDA_HOME=/usr/local/cuda-11.0 BUILDDIR=build-11.0
Then set LD_LIBRARY_PATH=$NCCL_HOME/build-10.0/lib:$LD_LIBRARY_PATH
when you launch with TF1+CUDA 10.0 and LD_LIBRARY_PATH=$NCCL_HOME/build-11.0/lib:$LD_LIBRARY_PATH
when you use TF2+CUDA 11.0.
That's a bit weird. It looks like NCCL was not compiled with your GPU architecture.
On which GPU did that error happen? The 1080 or the 3080?
According to https://developer.nvidia.com/cuda-gpus, the 1080 should need CUDA arch 6.1 and the 3080 would need 8.6. Maybe you can recompile NCCL with the following options to make sure you have the right archs:
make clean make -j NVCC_GENCODE="-gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_86,code=sm_86"
Thanks, On 1080Ti GPUs give this error in TensorFlow2.x environment, the 3080 is OK in both TensorFlow1.x and TensorFlow2.x. Maybe it really do not call the reinstalled new NCCL, because I see it give the logs info like this:
It still use the NCCL2.4.8?But I uninstalled it, what's wrong? I am recomile the NCCL with NVCC_GENCODE="-gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_86,code=sm_86 now
As for CUDA versions, it is advised to use a NCCL library compiled with the same CUDA version as the framework.
You can recompile NCCL for each CUDA version with:
make <make options> CUDA_HOME=/usr/local/cuda-10.0 BUILDDIR=build-10.0 make <make options> CUDA_HOME=/usr/local/cuda-11.0 BUILDDIR=build-11.0
Then set
LD_LIBRARY_PATH=$NCCL_HOME/build-10.0/lib:$LD_LIBRARY_PATH
when you launch with TF1+CUDA 10.0 andLD_LIBRARY_PATH=$NCCL_HOME/build-11.0/lib:$LD_LIBRARY_PATH
when you use TF2+CUDA 11.0.
Ok, thanks again, that is to say, install two different for each CUDA, right? I will try it right now.
Hi, I'm using TensorFlow and NCCL、MPI to use multi GPUs to train model, but meet some proble. I have two environments build with Anaconda----TensorFlow 1.x and TensorFlow 2.x,they use different CUDA version----CUDA 10.0 and CUDA 11.0,when I use TensorFlow 1.x to cunduct multi GPUs training, everything seems OK, but when I use TensorFlow2.x (code for Keras model train with horovod), it goes wrong. The information of NCCL is below:
Here is some questions:
The readme file saied the NCCL is installed in /user/local/cuda path, that is to say the installed NCCL version is for CUDA 10.0 (the soft link of cud is for CUDA 10.0) as dispalyed above, it can be normally used in TensorFlow1.x environment. But in CUDA 11.0 environment (TensorFlow2.x), the machine can use this NCCL? Or should I install two different NCCL for different CUDA in a machine?
The NCCL Version need to be more than 2.7 in TensorFlow2.x? As I test the NCCL with make -j MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/, it give error: identifier "ncclSend" is undefined as below:
But in another machine, nccl-repo-ubuntu1804-2.8.3-ga-cuda11.1/now 1-1 amd64 is installed, everything is OK, two CUDA version-----CUDA 11.1 and CUDA 11.6 for TensorFlow1.x and TensorFlow2.x environments, NCCL works fine in both environments, and the nccl-tests is also OK. What's wrong?
I'm quite confused by this problem, any one can give me some suggestion? Thanks a lot in advance! I would greatly appreciate it if someone could help me answer these questions.