eth-cscs / COSMA

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
BSD 3-Clause "New" or "Revised" License
192 stars 27 forks source link

build failure with nccl #121

Open loveshack opened 1 year ago

loveshack commented 1 year ago

This is from trying to to update the spack package to 2.6.2 and provide NCCL/RCCL support, but it doesn't look as if it's related to spack. Building fails when I enable NCCL, but works without it; I'm puzzled why, as it must usually work.

The cmake args which fail (with openmpi-4.1.4, cuda-11.4.1, nccl-2.14.3-1) are

-DCOSMA_WITH_TESTS:STRING=OFF -DCOSMA_WITH_APPS:STRING=OFF -DCOSMA_WITH_PROFILING:STRING=OFF -DCOSMA_WITH_BENCHMARKS:STRING=OFF -DCOSMA_BLAS:STRING=CUDA -DCOSMA_SCALAPACK:STRING=CUSTOM -DBUILD_SHARED_LIBS=ON -DCOSMA_WITH_GPU_AWARE_MPI:STRING=ON -DCOSMA_WITH_NCCL=ON

It succeeds when -DCOSMA_WITH_NCCL=ON is removed.

There are two different failures, depending on whether openmpi is built with C++ support.

With openmpi+cxx, the failure is

[ 83%] Linking CXX shared library libcosma.so
cd /tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so/src/cosma && /usr/bin/cmake -E cmake_link_script CMakeFiles/cosma.dir/link.txt --verbose=1
/nobackup/projects/bdman01/mdehsdl3/spack.clean/lib/spack/env/gcc/g++ -fPIC -O2 -g -DNDEBUG -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -L/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -pthread -shared -Wl,-soname,libcosma.so -o libcosma.so CMakeFiles/cosma.dir/blas.cpp.o CMakeFiles/cosma.dir/buffer.cpp.o CMakeFiles/cosma.dir/communicator.cpp.o CMakeFiles/cosma.dir/context.cpp.o CMakeFiles/cosma.dir/interval.cpp.o CMakeFiles/cosma.dir/layout.cpp.o CMakeFiles/cosma.dir/local_multiply.cpp.o CMakeFiles/cosma.dir/mapper.cpp.o CMakeFiles/cosma.dir/math_utils.cpp.o CMakeFiles/cosma.dir/matrix.cpp.o CMakeFiles/cosma.dir/memory_pool.cpp.o CMakeFiles/cosma.dir/multiply.cpp.o CMakeFiles/cosma.dir/one_sided_communicator.cpp.o CMakeFiles/cosma.dir/strategy.cpp.o CMakeFiles/cosma.dir/two_sided_communicator.cpp.o CMakeFiles/cosma.dir/cinterface.cpp.o CMakeFiles/cosma.dir/environment_variables.cpp.o CMakeFiles/cosma.dir/pinned_buffers.cpp.o CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o  -Wl,-rpath,/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so/libs/COSTA/src/costa:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so/libs/Tiled-MM/src/Tiled-MM:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib: ../../libs/COSTA/src/costa/libcosta.so ../../libs/Tiled-MM/src/Tiled-MM/libTiled-MM.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib/libnccl.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib/libmpi_cxx.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib/libmpi.so /usr/lib/gcc/ppc64le-redhat-linux/8/libgomp.so /usr/lib64/libpthread.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcublas.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcudart.so 
CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o: In function `cosma::gpu::check_runtime_status(cudaError)':
/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/include/openmpi/ompi/mpi/cxx/intracomm_inln.h:102: multiple definition of `cosma::gpu::check_runtime_status(cudaError)'
CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-src/src/cosma/gpu/utils.hpp:7: first defined here
collect2: error: ld returned 1 exit status
make[2]: *** [src/cosma/CMakeFiles/cosma.dir/build.make:413: src/cosma/libcosma.so] Error 1
make[2]: Leaving directory '/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so'

and without cxx it's

[ 83%] Linking CXX shared library libcosma.so
cd /tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey/src/cosma && /usr/bin/cmake -E cmake_link_script CMakeFiles/cosma.dir/link.txt --verbose=1
/nobackup/projects/bdman01/mdehsdl3/spack.clean/lib/spack/env/gcc/g++ -fPIC -O2 -g -DNDEBUG -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-tngp6b2qcx64wd7ndf53dmdeovlmui4h/lib -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -L/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -pthread -shared -Wl,-soname,libcosma.so -o libcosma.so CMakeFiles/cosma.dir/blas.cpp.o CMakeFiles/cosma.dir/buffer.cpp.o CMakeFiles/cosma.dir/communicator.cpp.o CMakeFiles/cosma.dir/context.cpp.o CMakeFiles/cosma.dir/interval.cpp.o CMakeFiles/cosma.dir/layout.cpp.o CMakeFiles/cosma.dir/local_multiply.cpp.o CMakeFiles/cosma.dir/mapper.cpp.o CMakeFiles/cosma.dir/math_utils.cpp.o CMakeFiles/cosma.dir/matrix.cpp.o CMakeFiles/cosma.dir/memory_pool.cpp.o CMakeFiles/cosma.dir/multiply.cpp.o CMakeFiles/cosma.dir/one_sided_communicator.cpp.o CMakeFiles/cosma.dir/strategy.cpp.o CMakeFiles/cosma.dir/two_sided_communicator.cpp.o CMakeFiles/cosma.dir/cinterface.cpp.o CMakeFiles/cosma.dir/environment_variables.cpp.o CMakeFiles/cosma.dir/pinned_buffers.cpp.o CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o  -Wl,-rpath,/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey/libs/COSTA/src/costa:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey/libs/Tiled-MM/src/Tiled-MM:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-tngp6b2qcx64wd7ndf53dmdeovlmui4h/lib: ../../libs/COSTA/src/costa/libcosta.so ../../libs/Tiled-MM/src/Tiled-MM/libTiled-MM.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib/libnccl.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-tngp6b2qcx64wd7ndf53dmdeovlmui4h/lib/libmpi.so /usr/lib/gcc/ppc64le-redhat-linux/8/libgomp.so /usr/lib64/libpthread.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcublas.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcudart.so 
CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o: In function `cosma::gpu::check_runtime_status(cudaError)':
/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-src/src/cosma/gpu/utils.hpp:7: multiple definition of `cosma::gpu::check_runtime_status(cudaError)'
CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-src/src/cosma/gpu/utils.hpp:7: first defined here
collect2: error: ld returned 1 exit status
make[2]: *** [src/cosma/CMakeFiles/cosma.dir/build.make:412: src/cosma/libcosma.so] Error 1
make[2]: Leaving directory '/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey'

By the way, as something else to add, what exactly does COSMA_WITH_GPU_AWARE_MPI mean? In the case of openmpi, it could be configuring --with-cuda and/or using a UCX built with cuda and/or gdrcopy.

simonpintarelli commented 1 year ago

This should be fixed in https://github.com/eth-cscs/COSMA/pull/130