BlueBrain / nmodl

Code Generation Framework For NEURON MODeling Language
https://bluebrain.github.io/nmodl/
Apache License 2.0
54 stars 15 forks source link

Eigen compatibility with OpenACC #311

Closed iomaganaris closed 3 years ago

iomaganaris commented 4 years ago

This issue was raised when running nrnivmodl-core with the following ModelDB model https://senselab.med.yale.edu/ModelDB/ShowModel?model=19176&file=%2fHCN2k%2fhcn2.mod#tabs-2 using PGI 19.4 and PGI 19.10 with the OpenACC backend generated from NMODL. The generated c++ file for hcn2.mod file contains a call to the following Eigen solver generated by the translation of the DERIVATIVE block:

X = Eigen::PartialPivLU<Eigen::Ref<Eigen::Matrix<double, 4, 4>>>(Jm).solve(F);

Compiling this file with pgc++ there is the following issue (with -Minfo=acc added to the compilation flags):

PGCC-W-0155-External and Static variables are not supported in acc routine - _ZTVSt9exception (x86_64/core/mod2c/hcn2.cpp: 63)
std::exception::exception():
     22, include "multicore.hpp"
          32, include "membfunc.hpp"
               31, include "vector"
                    61, include "allocator.h"
                         46, include "c++allocator.h"
                              33, include "new_allocator.h"
                                   33, include "new"
                                        40, include "exception"
                                             63, Generating implicit acc routine seq
PGCC-W-0155-External and Static variables are not supported in acc routine - _ZTVSt9exception (x86_64/core/mod2c/hcn2.cpp: 63)
std::exception::exception() [subobject]:
      0, Generating implicit acc routine seq
PGCC-W-0155-External and Static variables are not supported in acc routine - _ZTVSt9exception (x86_64/core/mod2c/hcn2.cpp: 57)
std::bad_alloc::bad_alloc():
     22, include "multicore.hpp"
          32, include "membfunc.hpp"
               31, include "vector"
                    61, include "allocator.h"
                         46, include "c++allocator.h"
                              33, include "new_allocator.h"
                                   33, include "new"
                                        57, Generating implicit acc routine seq

This was due to throwing an exception in https://gitlab.com/libeigen/eigen/-/blob/master/Eigen/src/Core/util/Memory.h#L70. After fixing this issue by commenting out the problematic line, there was another issue regarding atomic coming from https://gitlab.com/libeigen/eigen/-/blob/master/Eigen/src/Core/products/Parallelizer.h#L14 which was fixed by adding the -DEIGEN_HAS_CXX11_ATOMIC=0 compiler flag to pgc++.

PGCC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected load/store type (x86_64/core/mod2c/hcn2.cpp: 1)
std::__atomic_base<int>::store(int, std::memory_order):
     32, include "LU"
          11, include "Core"
              297, include "Parallelizer.h"
                    14, include "atomic"
                         41, include "atomic_base.h"
                             468, Generating implicit acc routine seq
                                  Generating acc routine seq
                                  Generating Tesla code
PGCC-F-0704-Compilation aborted due to previous errors. (x86_64/core/mod2c/hcn2.cpp

Following those, there was an issue coming from the llvm based pgc++ compiler, so we tried with the nollvm backend.

Eigen::EigenBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_product_op<double, double>, const Eigen::Transpose<const Eigen::Block<const Eigen::Block<Eigen::Block<Eigen::Ref<Eigen::Matrix<double, (int)4, (int)4, (int)0, (int)4, (int)4>, (int)0, Eigen::OuterStride<(int)-1>>, (int)4, (int)1, (bool)1>, (int)-1, (int)1, (bool)0>, (int)1, (int)1, (bool)0>>, const Eigen::Block<const Eigen::Block<Eigen::Block<Eigen::Ref<Eigen::Matrix<double, (int)4, (int)4, (int)0, (int)4, (int)4>, (int)0, Eigen::OuterStride<(int)-1>>, (int)1, (int)4, (bool)0>, (int)1, (int)-1, (bool)0>, (int)1, (int)1, (bool)0>>>::cols() const:
     32, include "LU"
          11, include "Core"
              240, include "EigenBase.h"
                    62, Generating implicit acc routine seq
                        Generating acc routine seq
                        Generating Tesla code
              259, include "NoAlias.h"
              261, include "Matrix.h"
              265, include "CwiseUnaryOp.h"
              271, include "Stride.h"
              273, include "Map.h"
              275, include "Block.h"
              279, include "Transpose.h"
              283, include "Redux.h"
              289, include "Solve.h"
              291, include "SolverBase.h"
              293, include "Transpositions.h"
pgc++-Fatal-/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/compilers/2020-02-01/linux-rhel7-x86_64/gcc-4.8.5/pgi-19.4-hdirysdrvd/linux86-64-llvm/19.4/bin/pggpp2-llvm TERMINATED by signal 11
Arguments to /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/compilers/2020-02-01/linux-rhel7-x86_64/gcc-4.8.5/pgi-19.4-hdirysdrvd/linux86-64-llvm/19.4/bin/pggpp2-llvm
/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/compilers/2020-02-01/linux-rhel7-x86_64/gcc-4.8.5/pgi-19.4-hdirysdrvd/linux86-64-llvm/19.4/bin/pggpp2-llvm x86_64/core/mod2c/hcn2.cpp -opt 2 -terse 1 -inform warn -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -vect 56 -y 34 16 -x 34 0x8 -x 32 25952256 -y 19 8 -y 35 0 -x 42 0x30 -x 39 0x40 -x 39 0x80 -x 59 4 -x 129 2 -tp skylake -x 120 0x1000 -astype 0 -x 121 1 -fn x86_64/core/mod2c/hcn2.cpp -il /gpfs/bbp.cscs.ch/ssd/slurmTmpFS/magkanar/773510/pgc++uoOV8vHgaqXY.il/hcn2.il -inlib /gpfs/bbp.cscs.ch/ssd/slurmTmpFS/magkanar/773510/pgc++0oOVCczrYGzB.ext -insize 200 -x 221 25000 -x 222 5 -x 115 10 -x 14 32 -x 117 0x200 -x 123 0x80000000 -x 123 4 -x 119 0x20 -def __pgnu_vsn=80300 -x 70 0x40000000 -x 183 4 -x 121 0x800 -x 6 0x20000 -autoinl 10 -x 168 400 -x 174 128000 -x 14 0x200000 -x 14 0x400000 -x 249 70 -x 120 0x200000 -x 70 0x40000000 -x 164 0x800000 -x 39 4 -x 68 0x1 -accel tesla -accel host -x 186 0x80000 -x 180 0x4000400 -x 163 0x1 -cudaver 10010 -x 176 0x100 -cudacap 70 -x 121 0xc00 -x 194 0x40000 -x 186 0x80 -x 189 0x8000 -y 163 0xc0000000 -x 189 0x10 -y 189 0x4000000 -cudaroot /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/compilers/2020-02-01/linux-rhel7-x86_64/gcc-4.8.5/pgi-19.4-hdirysdrvd/linux86-64-llvm/2019/cuda/10.1 -x 9 1 -x 72 0x1 -x 136 0x11 -x 37 0x481000 -mp -x 69 0x200 -x 69 0x400 -x 69 2 -x 0 0x1000000 -x 2 0x100000 -x 0 0x2000000 -x 161 16384 -x 162 16384 -x 62 8 -gnuvsn 80300 -x 69 0x200 -x 123 0x400 -cmdline '+pgc++ x86_64/core/mod2c/hcn2.cpp -O2 -Mvect=simd -acc --diag_suppress 177 -mp --c++14 -Minline=size:200,levels:10 -Minfo=acc -DEIGEN_HAS_CXX11_ATOMIC=0 -DPG_ACC_BUGS -DCUDA_PROFILING -DCORENEURON_BUILD -DHAVE_MALLOC_H -DSWAP_ENDIAN_DISABLE_ASM -DEIGEN_DONT_VECTORIZE=1 -DNRNMPI=1 -DLAYOUT=0 -DDISABLE_HOC_EXP -DENABLE_SPLAYTREE_QUEUING -I/gpfs/bbp.cscs.ch/project/proj16/magkanar/GPU_EIGEN/CoreNeuron/build_gpu_sympy/install/include -I/gpfs/bbp.cscs.ch/project/proj16/magkanar/GPU_EIGEN/CoreNeuron/build_gpu_sympy/install/include/coreneuron/utils/randoms -I/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/include -fPIC -c -o x86_64/core/build/hcn2.o' -asm /gpfs/bbp.cscs.ch/ssd/slurmTmpFS/magkanar/773510/pgc++uoOV8WmKrzY7.ll
make: *** [x86_64/core/build/hcn2.o] Error 127

The final issue we came across was the following:

PGCC-W-0155-External and Static variables are not supported in acc routine - _ZTVSt9exception (x86_64/core/mod2c/hcn2.cpp: 336)
void Eigen::internal::gemv_dense_selector<(int)2, (int)1, (bool)1>::run<Eigen::Transpose<const Eigen::Ref<Eigen::Matrix<double, (int)-1, (int)-1, (int)0, (int)-1, (int)-1>, (int)0, Eigen::OuterStride<(int)-1>>>, Eigen::Transpose<const Eigen::Block<const Eigen::Ref<Eigen::Matrix<double, (int)-1, (int)-1, (int)0, (int)-1, (int)-1>, (int)0, Eigen::OuterStride<(int)-1>>, (int)1, (int)-1, (bool)0>>, Eigen::Transpose<Eigen::Block<Eigen::Ref<Eigen::Matrix<double, (int)-1, (int)-1, (int)0, (int)-1, (int)-1>, (int)0, Eigen::OuterStride<(int)-1>>, (int)1, (int)-1, (bool)0>>>(const T1 &, const T2 &, T3 &, const T3::Scalar &):
     32, include "LU"
          11, include "Core"
              288, include "GeneralProduct.h"
                   309, Generating implicit acc routine seq
              292, include "PermutationMatrix.h"
              294, include "TriangularMatrix.h"
              317, include "VectorwiseOp.h"
Unimplemented opcode: 0
PGCC-F-0000-Internal compiler error. Unimplemented opcode.       4  (x86_64/core/mod2c/hcn2.cpp: 46)
PGCC/x86 Linux 19.10-0: compilation aborted
make: *** [x86_64/core/build/hcn2.o] Error 2

For this there is no solution found. To reproduce all the issues in a gpu node:

module load git
git clone https://github.com/BlueBrain/CoreNeuron.git
cd CoreNeuron
mkdir build_gpu_sympy
cd build_gpu_sympy
module load nvhpc cuda hpe-mpi boost python-dev flex bison cmake
cmake .. -DCMAKE_INSTALL_PREFIX=./install -DCORENRN_ENABLE_NMODL=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_NMODL_FLAGS="sympy --analytic" -DNRN_ENABLE_CORENEURON=ON -DCMAKE_BUILD_TYPE=Release
make -j36
cd HCN2k
mkdir mod_eigen
cp hcn2.mod mod_eigen
<install-dir>/bin/nrnivmodl -coreneuron mod_eigen

CoreNeuron and NMODL master branches were used

iomaganaris commented 4 years ago

Following the incompatibilities of Eigen with OpenACC I started investigating if Eigen can be called from CUDA kernels. To do this I am using a simple example using the Eigen::PartialPivLU solver created by @cattabiani on top of which I added a CUDA kernel to run the same solver. My tries are currently WIP here. During the development I faced 4 issues:

  1. Compilation issues when adding the Eigen::PartialPivLU solver in the CUDA kernel. To get the code compiled I needed to do the following changes in the Eigen source code:
    
    diff --git a/Eigen/src/Core/SolverBase.h b/Eigen/src/Core/SolverBase.h
    index 501461042..e7d5ca5a3 100644
    --- a/Eigen/src/Core/SolverBase.h
    +++ b/Eigen/src/Core/SolverBase.h
    @@ -94,7 +94,7 @@ class SolverBase : public EigenBase<Derived>
     SolverBase()
     {}

@@ -593,7 +593,7 @@ struct Assignment<DstXprType, Inverse<PartialPivLU >, internal::assi

1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1

Random vector:

1 3 5 3

Solution (x) of M*x = v:

0 1 2 3

Device Solution (x) of M*x = v:

0x7fff7a000400

I tried to debug this with `ddt`, `cuda-gdb` and `cuda-memcheck` and I get the following with `cuda-gdb`:

Starting program: /gpfs/bbp.cscs.ch/project/proj16/magkanar/GPU_EIGEN/testEigen/build/testEigenGPU [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7fffcffff700 (LWP 244304)] Size of the matrix? 4 v_device data: 1 3 5 3 [New Thread 0x7fffbdb06700 (LWP 244306)] v in device 0.000000 0.000000 0.000000 0.000000

Thread 1 "testEigenGPU" received signal SIGTRAP, Trace/breakpoint trap. [Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0] 0x0000000000db4900 in runPartialPivLuGPU(double, double, double*, int)<<<(1,1,1),(1,1,1)>>> ()

and `cuda-memcheck`:

bash-4.2$ cuda-memcheck ./testEigenGPU ========= CUDA-MEMCHECK Size of the matrix? 4 v_device data: 1 3 5 3 ========= Unknown Error

v in device 0.000000 0.000000 0.000000 0.000000 ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/lib64/libcuda.so.1 [0x3b9803] ========= Host Frame:./testEigenGPU [0x54d16] ========= Host Frame:./testEigenGPU [0x4caf] ========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22495] ========= Host Frame:./testEigenGPU [0x4489]

Error with cudaDeviceSync: unspecified launch failure


By googling the errors I found out that those errors probably come from some segmentation fault coming from the `Eigen::PartialPivLU` solver. 
4. A bunch of `warning: calling a __host__ function from a __host__ __device__ function is not allowed` during compilation, which I don't know if they are really used by the solver and are the root of all the errors

`CUDA` used: `10.1.243`
`GCC` used: `9.3.0`

TODO:
~~1. Check why the matrices are not copied correctly into the device~~
2. Try to debug the errors

cc: @pramodk @ohm314