This is from trying to to update the spack package to 2.6.2 and provide NCCL/RCCL support, but it doesn't look as if it's related to spack. Building fails when I enable NCCL, but works without it; I'm puzzled why, as it must usually work.
The cmake args which fail (with openmpi-4.1.4, cuda-11.4.1, nccl-2.14.3-1) are
There are two different failures, depending on whether openmpi is built with C++ support.
With openmpi+cxx, the failure is
[ 83%] Linking CXX shared library libcosma.so
cd /tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so/src/cosma && /usr/bin/cmake -E cmake_link_script CMakeFiles/cosma.dir/link.txt --verbose=1
/nobackup/projects/bdman01/mdehsdl3/spack.clean/lib/spack/env/gcc/g++ -fPIC -O2 -g -DNDEBUG -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -L/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -pthread -shared -Wl,-soname,libcosma.so -o libcosma.so CMakeFiles/cosma.dir/blas.cpp.o CMakeFiles/cosma.dir/buffer.cpp.o CMakeFiles/cosma.dir/communicator.cpp.o CMakeFiles/cosma.dir/context.cpp.o CMakeFiles/cosma.dir/interval.cpp.o CMakeFiles/cosma.dir/layout.cpp.o CMakeFiles/cosma.dir/local_multiply.cpp.o CMakeFiles/cosma.dir/mapper.cpp.o CMakeFiles/cosma.dir/math_utils.cpp.o CMakeFiles/cosma.dir/matrix.cpp.o CMakeFiles/cosma.dir/memory_pool.cpp.o CMakeFiles/cosma.dir/multiply.cpp.o CMakeFiles/cosma.dir/one_sided_communicator.cpp.o CMakeFiles/cosma.dir/strategy.cpp.o CMakeFiles/cosma.dir/two_sided_communicator.cpp.o CMakeFiles/cosma.dir/cinterface.cpp.o CMakeFiles/cosma.dir/environment_variables.cpp.o CMakeFiles/cosma.dir/pinned_buffers.cpp.o CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o -Wl,-rpath,/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so/libs/COSTA/src/costa:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so/libs/Tiled-MM/src/Tiled-MM:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib: ../../libs/COSTA/src/costa/libcosta.so ../../libs/Tiled-MM/src/Tiled-MM/libTiled-MM.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib/libnccl.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib/libmpi_cxx.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib/libmpi.so /usr/lib/gcc/ppc64le-redhat-linux/8/libgomp.so /usr/lib64/libpthread.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcublas.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcudart.so
CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o: In function `cosma::gpu::check_runtime_status(cudaError)':
/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/include/openmpi/ompi/mpi/cxx/intracomm_inln.h:102: multiple definition of `cosma::gpu::check_runtime_status(cudaError)'
CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-src/src/cosma/gpu/utils.hpp:7: first defined here
collect2: error: ld returned 1 exit status
make[2]: *** [src/cosma/CMakeFiles/cosma.dir/build.make:413: src/cosma/libcosma.so] Error 1
make[2]: Leaving directory '/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so'
and without cxx it's
[ 83%] Linking CXX shared library libcosma.so
cd /tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey/src/cosma && /usr/bin/cmake -E cmake_link_script CMakeFiles/cosma.dir/link.txt --verbose=1
/nobackup/projects/bdman01/mdehsdl3/spack.clean/lib/spack/env/gcc/g++ -fPIC -O2 -g -DNDEBUG -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-tngp6b2qcx64wd7ndf53dmdeovlmui4h/lib -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -L/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -pthread -shared -Wl,-soname,libcosma.so -o libcosma.so CMakeFiles/cosma.dir/blas.cpp.o CMakeFiles/cosma.dir/buffer.cpp.o CMakeFiles/cosma.dir/communicator.cpp.o CMakeFiles/cosma.dir/context.cpp.o CMakeFiles/cosma.dir/interval.cpp.o CMakeFiles/cosma.dir/layout.cpp.o CMakeFiles/cosma.dir/local_multiply.cpp.o CMakeFiles/cosma.dir/mapper.cpp.o CMakeFiles/cosma.dir/math_utils.cpp.o CMakeFiles/cosma.dir/matrix.cpp.o CMakeFiles/cosma.dir/memory_pool.cpp.o CMakeFiles/cosma.dir/multiply.cpp.o CMakeFiles/cosma.dir/one_sided_communicator.cpp.o CMakeFiles/cosma.dir/strategy.cpp.o CMakeFiles/cosma.dir/two_sided_communicator.cpp.o CMakeFiles/cosma.dir/cinterface.cpp.o CMakeFiles/cosma.dir/environment_variables.cpp.o CMakeFiles/cosma.dir/pinned_buffers.cpp.o CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o -Wl,-rpath,/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey/libs/COSTA/src/costa:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey/libs/Tiled-MM/src/Tiled-MM:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-tngp6b2qcx64wd7ndf53dmdeovlmui4h/lib: ../../libs/COSTA/src/costa/libcosta.so ../../libs/Tiled-MM/src/Tiled-MM/libTiled-MM.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib/libnccl.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-tngp6b2qcx64wd7ndf53dmdeovlmui4h/lib/libmpi.so /usr/lib/gcc/ppc64le-redhat-linux/8/libgomp.so /usr/lib64/libpthread.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcublas.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcudart.so
CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o: In function `cosma::gpu::check_runtime_status(cudaError)':
/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-src/src/cosma/gpu/utils.hpp:7: multiple definition of `cosma::gpu::check_runtime_status(cudaError)'
CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-src/src/cosma/gpu/utils.hpp:7: first defined here
collect2: error: ld returned 1 exit status
make[2]: *** [src/cosma/CMakeFiles/cosma.dir/build.make:412: src/cosma/libcosma.so] Error 1
make[2]: Leaving directory '/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey'
By the way, as something else to add, what exactly does COSMA_WITH_GPU_AWARE_MPI mean? In the case of openmpi, it could be configuring --with-cuda and/or using a UCX built with cuda and/or gdrcopy.
This is from trying to to update the spack package to 2.6.2 and provide NCCL/RCCL support, but it doesn't look as if it's related to spack. Building fails when I enable NCCL, but works without it; I'm puzzled why, as it must usually work.
The cmake args which fail (with openmpi-4.1.4, cuda-11.4.1, nccl-2.14.3-1) are
It succeeds when -DCOSMA_WITH_NCCL=ON is removed.
There are two different failures, depending on whether openmpi is built with C++ support.
With openmpi+cxx, the failure is
and without cxx it's
By the way, as something else to add, what exactly does COSMA_WITH_GPU_AWARE_MPI mean? In the case of openmpi, it could be configuring --with-cuda and/or using a UCX built with cuda and/or gdrcopy.