Palace v0.13.0 fails to compile with CUDA

alextusnin commented 1 month ago

Hello, I tried to use the recent v0.13.0 release with CUDA support. I built it from source with almost default options

cd build
cmake -DCMAKE_CXX_FLAGS='-fPIC -O3' -DCMAKE_C_FLAGS='-fPIC -O3' -DPALACE_WITH_CUDA=ON  ..
make

and it worked flawlessly. However, when I try to build palace with the flag -DPALACE_WITH_CUDA=ON, the installation crashes reaching the magma library: I get the following error

Cloning into 'magma'...
HEAD is now at e20a6748 Merged in dgeev-docs (pull request #51)
[ 14%] No update step for 'magma'
[ 15%] Performing patch step for 'magma'
make.inc:1: *** missing separator.  Stop.
make[2]: *** [extern/CMakeFiles/magma.dir/build.make:115: extern/magma-cmake/src/magma-stamp/magma-patch] Error 2
make[1]: *** [CMakeFiles/Makefile2:383: extern/CMakeFiles/magma.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

Is it an issue with CMake or with my environment/cuda drivers? Can you please suggest a solution for this issue? Thanks!

sebastiangrimberg commented 1 month ago

Hi @alextusnin, I'm unable to reproduce this on a Linux system (Amazon Linux 2) with CMake v3.26.6. Here is the output I get from make magma after the same cmake configuration you used:

[  0%] Creating directories for 'magma'
[ 16%] Performing download step (git clone) for 'magma'
Cloning into 'magma'...
HEAD is now at e20a6748 Merged in dgeev-docs (pull request #51)
[ 16%] No update step for 'magma'
[ 33%] Performing patch step for 'magma'
[ 50%] Performing configure step for 'magma'
-- The C compiler identification is GNU 12.3.0
-- The CXX compiler identification is GNU 12.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/gcc/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/gcc/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test COMPILER_SUPPORTS_CXX11
-- Performing Test COMPILER_SUPPORTS_CXX11 - Success
-- Performing Test COMPILER_SUPPORTS_CXX0X
-- Performing Test COMPILER_SUPPORTS_CXX0X - Success
-- Performing Test COMPILER_SUPPORTS_FPIC
-- Performing Test COMPILER_SUPPORTS_FPIC - Success
-- Performing Test COMPILER_SUPPORTS_C99
-- Performing Test COMPILER_SUPPORTS_C99 - Success
-- Building without Fortran compiler
--     Using -DADD_ for Fortran calling convention
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Found OpenMP
--     OpenMP_C_FLAGS   -fopenmp
--     OpenMP_CXX_FLAGS -fopenmp
-- The CUDA compiler identification is NVIDIA 12.4.131
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.4.131")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found CUDA
--     CUDA_CUDART_LIBRARY: CUDA::cudart
--     compile for CUDA arch 5.2 (Maxwell)
-- Define -DMAGMA_HAVE_CUDA -DMAGMA_CUDA_ARCH_MIN=520
-- User set LAPACK_LIBRARIES. To change, edit LAPACK_LIBRARIES using ccmake (set to empty to enable search).
-- MKLROOT set to /apps/intel/oneapi/mkl/2024.0. To change, edit MKLROOT using ccmake.
-- Define -DMAGMA_WITH_MKL
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of void*
-- Check size of void* - done
CMake Deprecation Warning at CMakeLists.txt:752 (cmake_policy):
  The OLD behavior for policy CMP0037 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

-- pkgconfig lib/pkgconfig/magma.pc
-- Flags
--     CMAKE_INSTALL_PREFIX:  /data/home/sjg/software/palace/build-cuda-x86_64
--     CFLAGS:                -fPIC -O3 -std=c99   -DADD_ -fopenmp -Wall -Wno-unused-function
--     CXXFLAGS:              -fPIC -O3 -std=c++11 -DADD_ -fopenmp -Wall -Wno-unused-function
--     NVCCFLAGS:
--     FFLAGS:
--     LIBS:                   /apps/intel/oneapi/mkl/2024.0/lib/libmkl_intel_lp64.so /apps/intel/oneapi/mkl/2024.0/lib/libmkl_sequential.so /apps/intel/oneapi/mkl/2024.0/lib/libmkl_core.so -lpthread -lm -ldl -L/usr/local/cuda/lib64 -lcudart -lcublas -lcusparse
--     blas_fix:                (MacOS Accelerate only)
--     LAPACK_LIBRARIES:      /apps/intel/oneapi/mkl/2024.0/lib/libmkl_intel_lp64.so;/apps/intel/oneapi/mkl/2024.0/lib/libmkl_sequential.so;/apps/intel/oneapi/mkl/2024.0/lib/libmkl_core.so;-lpthread;-lm;-ldl
--     INCLUDE_DIRECTORIES:   -I/usr/local/cuda/include -I/apps/intel/oneapi/mkl/2024.0/include -I/data/home/sjg/software/palace/build-cuda-x86_64/extern/magma-build/include -I/data/home/sjg/software/palace/build-cuda-x86_64/extern/magma/include -I/data/home/sjg/software/palace/build-cuda-x86_64/extern/magma/control -I/data/home/sjg/software/palace/build-cuda-x86_64/extern/magma/magmablas -I/data/home/sjg/software/palace/build-cuda-x86_64/extern/magma/sparse/include -I/data/home/sjg/software/palace/build-cuda-x86_64/extern/magma/sparse/control -I/data/home/sjg/software/palace/build-cuda-x86_64/extern/magma/testing
--     CUDA_CUDART_LIBRARY:   CUDA::cudart
--     CUDA_CUBLAS_LIBRARIES: CUDA::cublas
--     CUDA_cusparse_LIBRARY: CUDA::cusparse
--     Fortran modules:
-- Configuring done (18.6s)
...

Could you perhaps share the output of make magma VERBOSE=1 in a clean build directory following the cmake configuration step, to see what the issue leading to your missing separator error is?

sebastiangrimberg commented 1 month ago

I think I may have found the issue and it could be due to cross-platform compatibility depending on what OS you are using. Can you try using the patch in the branch sjg/magma-patch-fix.

alextusnin commented 1 month ago

Hi @sebastiangrimberg, Thank you very much, the patch works! By the way, I am using ubuntu 22.04.4 with cmake 3.29.3, cuda 12.5, and gcc 11.4

Unfortunately, Now the installation crashes during the mfem compilation (with the same cmake flags as above). Please see the log file pure mfem build mfem.log. I saw the compilation error is related to the inline specifiers, so I decided to explicitly pass the std=c++17 flag to cmake. It did not work either and the hypre does not compile in this case... (log mfem_c++17.log)

sebastiangrimberg commented 1 month ago

Thanks for the notes. I looked into it and found a bug in MFEM for old CUDA architectures which has not been caught yet. In MFEM's backends.hpp, I think the following diff is needed:

diff --git a/general/backends.hpp b/general/backends.hpp
index 0c2f3a531..c9ab7379b 100644
--- a/general/backends.hpp
+++ b/general/backends.hpp
@@ -65,7 +65,7 @@

 // 'double' atomicAdd implementation for previous versions of CUDA
 #if defined(MFEM_USE_CUDA) && defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 600
-MFEM_DEVICE inline real_t atomicAdd(real_t *add, real_t val)
+MFEM_DEVICE inline mfem::real_t atomicAdd(mfem::real_t *add, mfem::real_t val)
 {
    unsigned long long int *ptr = (unsigned long long int *) add;
    unsigned long long int old = *ptr, reg;
(END)

I've submitted a PR to MFEM to fix this: https://github.com/mfem/mfem/pull/4332.

In the mean time, I have pushed the required patch and changes to the same sjg/magma-patch-fix branch. Could you clear your build directory and try again? Thanks for your help debugging this. This resolved the issue when building with CMAKE_CUDA_ARCHITECTURES=52 for me.

As another note, recent versions of CMake now set CMAKE_CUDA_ARCHITECTURES to a default value which happens to be quite old. This prevents Palace from setting the default of 70 (set here, which is applicable for V100 GPUs and newer). I would recommend always providing this value for the GPUs you are building for. This would have avoided the above MFEM bug as well.

sebastiangrimberg commented 4 weeks ago

Hi @alextusnin, did you get a chance to try out #258 and see if it resolves your build issues? If so I'd like to merge it ASAP in case other users are experiencing similar problems. Thanks!

alextusnin commented 4 weeks ago

Hi @sebastiangrimberg, Thanks for your help! I pulled the #258 and tried to build it. The process passed the previous problem but failed towards the end... Please see the log log.log

sebastiangrimberg commented 4 weeks ago

Thanks for continuing to test for me. The error during linking I have not seen before:

/tmp/ccMdsxcH.s: Assembler messages:
/tmp/ccMdsxcH.s:46: Error: symbol `fatbinData' is already defined

It seems related to link-time optimization (LTO) which is usually specified as a -flto flag to the compiler but maybe on Ubuntu it is being added by default and incompatible with nvcc from CUDA (I suspect this from https://github.com/lammps/lammps/issues/3808 and https://forums.developer.nvidia.com/t/link-time-optimization-with-cuda-on-linux-flto/55530/5).

Perhaps you can bypass this by specifying the CUDA flags manually during CMake configuration, adding to your cmake command:

-DCMAKE_CUDA_FLAGS="-Xcompiler \"-fPIC -O3\""

I wonder if this will fix it by forcing CMake to not provide -flto to nvcc.

alextusnin commented 3 weeks ago

Hi @sebastiangrimberg I tried to do the build with the flags you proposed, but I get basically the same error

lto-wrapper: warning: using serial compilation of 128 LTRANS jobs
/tmp/ccoChGXQ.s: Assembler messages:
/tmp/ccoChGXQ.s:46: Error: symbol `fatbinData' is already defined

sebastiangrimberg commented 3 weeks ago

Alright, another idea (from here or here) is to explicitly add the -fno-lto flag to disable LTO. Can you try adding that to your CMAKE_CXX_FLAGS and CMAKE_C_FLAGS? You may also need to add it to the -Xcompiler arguments in your CMAKE_CUDA_FLAGS as well, I'm not sure.

This doesn't seem to be a Palace problem, though. So if this doesn't work maybe working backwards and trying to just build MFEM separately from source with CUDA support using CMake on your system may be a better way to debug. I suspect somewhere in your compiler and system setup there is a default which is making MFEM + CMake break. A final option would be to just try and experiment with using Spack to build the whole stack as well. Thanks for your help in debugging this as I am still unable to reproduce with CUDA builds of Palace on a RHEL-based OS.

sebastiangrimberg commented 3 weeks ago

Just another minor update here @alextusnin, I have tried to reproduce the issue on a clean Ubuntu 24.04 installation with CUDA 12.5 and Open MPI but everything works just fine. I configured with:

cmake ../palace -DPALACE_WITH_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++

alextusnin commented 3 weeks ago

Thanks a lot for your help @sebastiangrimberg I tried what you suggested and it still didn't work. I'll see later what I can do. Maybe I'll upgrade to 24.04 later and try once again

sebastiangrimberg commented 3 weeks ago

Shoot, that's unfortunate to hear. I wish I could be of more help in debugging but it's hard to know how to fix it without the ability to reproduce. I have gone ahead and merged https://github.com/awslabs/palace/pull/258 at least, which resolves the few issues we did solve so thank you for your help there! Hopefully this will resolve itself in time, or someone else might be able to chime in if they experience the same issue.

awslabs / palace

Palace v0.13.0 fails to compile with CUDA #256