ECP-copa / CabanaMD

Molecular dynamics proxy application based on Cabana
Other
19 stars 13 forks source link

Unable to run NNP example #77

Closed singraber closed 3 years ago

singraber commented 3 years ago

I am trying to run the NNP example in input/in.nnp but after the symmetry function setup is completed I get the following error in the SETUP: SYMMETRY FUNCTION GROUPS section:

terminate called after throwing an instance of 'std::runtime_error'
  what():  View bounds error of view AngularCounter ( 1 < 1 )
Traceback functionality not available

I am starting CabanaMD with the following command:

 ~/local/src/openmpi/4.0.4/build/bin/mpiexec -n 1  build/bin/cbnMD -il input/in.nnp --device-type SERIAL

The error occurs with any of the three device targets: SERIAL, OPENMP and CUDA

When I run with gdb and look at the backtrace I find:

#6  0x0000555557159719 in Kokkos::Impl::throw_runtime_exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#7  0x0000555556d48c19 in Kokkos::Impl::view_verify_operator_bounds<Kokkos::HostSpace, Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*, Kokkos::LayoutRight, Kokkos::HostSpace>, void>, int> (tracker=..., map=...)
    at /home/andi/local/src/kokkos/3.1.01/build/install/include/impl/Kokkos_ViewMapping.hpp:3813
#8  0x0000555556bd7b53 in Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::HostSpace>::operator()<int> (i0=<optimized out>, this=<optimized out>)
    at /home/andi/local/src/kokkos/3.1.01/build/install/include/Kokkos_View.hpp:1241
#9  nnpCbn::Element::setupSymmetryFunctionGroups<Kokkos::View<double** [15], Kokkos::LayoutRight, Kokkos::HostSpace>, Kokkos::View<int***, Kokkos::LayoutRight, Kokkos::HostSpace>, Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::HostSpace> > (this=0x55555a3a4cf0, SF=..., SFGmemberlist=..., attype=0, 
---Type <return> to continue, or q <return> to quit---
    h_numSFperElem=..., h_numSFGperElem=..., maxSFperElem=27)
    at /home/andi/local/src/CabanaMD/master/src/force_types/nnp_element_impl.h:375
#10 0x00005555563ec769 in nnpCbn::Mode<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::setupSymmetryFunctionGroups (this=0x55555b920de0)
    at /home/andi/local/src/CabanaMD/master/src/force_types/nnp_mode_impl.h:615
#11 0x000055555633314b in ForceNNP<System<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, AoSoA6>, System_NNP<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, AoSoA3>, NeighborVerlet<System<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, AoSoA6>, Cabana::FullNeighborTag, Cabana::VerletLayout2D>, Cabana::SerialOpTag, Cabana::SerialOpTag>::init_coeff (this=0x55555c21f950, 
    args=std::vector of length 1, capacity 1 = {...})
    at /home/andi/local/src/CabanaMD/master/src/force_types/force_nnp_cabana_neigh_impl.h:59
#12 0x0000555555f0c0aa in CbnMD<System<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, AoSoA6>, NeighborVerlet<System<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, AoSoA6>, Cabana::FullNeighborTag, Cabana::VerletLayout2D> >::init (
    this=0x55555b9bb680, commandline=...)
    at /home/andi/local/src/CabanaMD/master/src/cabanamd_impl.h:178

which brings me here:

https://github.com/ECP-copa/CabanaMD/blob/562600e9cbd2c8ee2ecfb34fe70630cebfda5e97/src/force_types/nnp_element_impl.h#L375

and then descends into Kokkos... do you have any idea why this error happens and how I can resolve it?

I used the following setup to compile Kokkos, Cabana and CabanaMD:

My system:

Kokkos (version 3.1.01) build flags:

In the nvcc_wrapper script I set default_arch="sm_61".

 -DCMAKE_CXX_COMPILER=${KOKKOS_SRC_DIR}/bin/nvcc_wrapper \
 -DCMAKE_INSTALL_PREFIX=${KOKKOS_SRC_DIR}/build/install \
 -DKokkos_CUDA_DIR=/usr/local/cuda-11.0/ \
 -DKokkos_ENABLE_SERIAL=On \
 -DKokkos_ENABLE_OPENMP=On \
 -DKokkos_ENABLE_CUDA=On \
 -DKokkos_ENABLE_CUDA_LAMBDA=On \
 -DKokkos_ENABLE_CUDA_UVM=On \
 -DKokkos_ARCH_PASCAL61=On \
 -DKokkos_ENABLE_HWLOC=On \
 -DKokkos_ENABLE_TESTS=On \
 -DKokkos_ENABLE_DEBUG=On \
 -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=On \

Cabana (66c94f6) build flags:

 -DCMAKE_BUILD_TYPE="Debug" \
 -DCMAKE_PREFIX_PATH="${KOKKOS_INSTALL_DIR};${HOME}/local/src/openmpi/4.0.4/build/" \
 -DCMAKE_INSTALL_PREFIX=${CABANA_INSTALL_DIR} \
 -DCMAKE_CXX_COMPILER=${KOKKOS_SRC_DIR}/bin/nvcc_wrapper \
 -DMPI_CXX_COMPILER=${HOME}/local/src/openmpi/4.0.4/build/bin/mpic++ \
 -DCabana_REQUIRE_CUDA=On \
 -DCabana_ENABLE_MPI=On \
 -DCabana_ENABLE_EXAMPLES=On \
 -DCabana_ENABLE_TESTING=On \

CabanaMD (562600e) build flags:

 -DCMAKE_BUILD_TYPE="Debug" \
 -DCMAKE_CXX_COMPILER=${KOKKOS_DIR}/bin/nvcc_wrapper \
 -DCMAKE_PREFIX_PATH="${CABANA_DIR};${HOME}/local/src/openmpi/4.0.4/build/" \
 -DCMAKE_INSTALL_PREFIX=${CABANAMD_INSTALL_DIR} \
 -DMPI_CXX_COMPILER=${HOME}/local/src/openmpi/4.0.4/build/bin/mpic++ \
 -DCabana_ENABLE_MPI=On \
 -DCabanaMD_VECTORLENGTH=32 \
 -DN2P2_DIR=${HOME}/local/src/n2p2-singraber/ \
 -DCabanaMD_ENABLE_NNP=On \
 -DCabanaMD_MAXSYMMFUNC_NNP=30 \
 -DCabanaMD_VECTORLENGTH_NNP=1 \
 -DCabanaMD_ENABLE_TESTING=ON \

There is also an additional issue with the tests of CabanaMD which may be unrelated but who knows...:

The tests of Kokkos and Cabana pass without any errors but when I run ctest -VV in the CabanaMD build directory I get the same error for both CUDA-related tests (Integrator_test_CUDA and Neighbor_test_CUDA):

[ RUN      ] cuda.reversibility_test
Kokkos::View ERROR: attempt to access inaccessible memory space
Thread 1 "Integrator_test" received signal SIGABRT, Aborted.

Running the tests manually and backtracing with gdb shows:

#3  0x000055555556caa0 in Kokkos::abort (
    message=0x55555567fe30 "Kokkos::View ERROR: attempt to access inaccessible memory space")
    at /home/andi/local/src/kokkos/3.1.01/build/install/include/impl/Kokkos_Error.hpp:175
#4  0x0000555555576ee7 in Kokkos::View<double*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::verify_space<Kokkos::HostSpace, false>::check ()
    at /home/andi/local/src/kokkos/3.1.01/build/install/include/Kokkos_View.hpp:882
#5  Kokkos::View<double*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::operator()<int> (i0=<optimized out>, this=0x7fffffffc730)
    at /home/andi/local/src/kokkos/3.1.01/build/install/include/Kokkos_View.hpp:1241
#6  Test::createParticles<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, AoSoA6> > (num_particle=1000, num_ghost=200, box_min=-12.295999999999999, 
    box_max=10.904)
    at /home/andi/local/src/CabanaMD/master/unit_test/tstIntegrator.hpp:38
#7  0x000055555556e6d2 in Test::testIntegratorReversibility<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, AoSoA6> > (steps=100)
    at /home/andi/local/src/CabanaMD/master/unit_test/tstIntegrator.hpp:91

and

#3  0x000055555557190b in Kokkos::abort (
    message=0x5555556c2ca8 "Kokkos::View ERROR: attempt to access inaccessible memory space")
    at /home/andi/local/src/kokkos/3.1.01/build/install/include/impl/Kokkos_Error.hpp:175
#4  0x000055555557bc05 in Kokkos::View<double*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::verify_space<Kokkos::HostSpace, false>::check ()
    at /home/andi/local/src/kokkos/3.1.01/build/install/include/Kokkos_View.hpp:882
#5  Kokkos::View<double*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::operator()<int> (i0=<optimized out>, this=0x7fffffffc730)
    at /home/andi/local/src/kokkos/3.1.01/build/install/include/Kokkos_View.hpp:1241
#6  Test::createAtoms<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, AoSoA6> > (num_atom=1000, num_ghost=200, box_min=-12.295999999999999, 
    box_max=10.904)
    at /home/andi/local/src/CabanaMD/master/unit_test/tstNeighbor.hpp:255
#7  0x0000555555573790 in Test::testNeighborListPartialRange<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, AoSoA6>, NeighborVerlet<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, AoSoA6>, Cabana::FullNeighborTag, Cabana::V--Type <RET> for more, q to quit, c to continue without paging--c
erletLayout2D> > (half_neigh=false) at /home/andi/local/src/CabanaMD/master/unit_test/tstNeighbor.hpp:303

for Integrator_test_CUDA and Neighbor_test_CUDA, respectively.

Sorry for this overly long post... I am out of ideas for now, any help is greatly appreciated!

Thank you!!

streeve commented 3 years ago

Thanks for posting this! And sorry that even running the default case for NNP fails - I'll take a look at this right now.

streeve commented 3 years ago

I can reproduce the error with Debug and will link to the PR to fix when I get it figured out.

Using Release (and without the additional Kokkos debug flags) runs, but this is another reminder to get GPU CI running sooner rather than later.

singraber commented 3 years ago

Thanks for looking so quickly at this! I just rebuilt everything with Release target and with removed flags:

Kokkos:

# -DKokkos_ENABLE_TESTS=On \
# -DKokkos_ENABLE_DEBUG=On \
# -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=On \

Cabana:

# -DCabana_ENABLE_EXAMPLES=On \
# -DCabana_ENABLE_TESTING=On \

CabanaMD:

# -DCabanaMD_ENABLE_TESTING=ON \

The result is that now the NNP example works fine for the device types SERIAL and OPENMP. Unfortunately, CUDA still does not work, giving this error:

  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/andi/local/src/kokkos/3.1.01/core/src/Cuda/Kokkos_Cuda_Instance.cpp:143

gdb gives now no more line numbers but only:

#7  0x00005555569390c4 in Kokkos::Impl::cuda_internal_error_throw(cudaError, char const*, char const*, int) ()
#8  0x00005555569380d0 in Kokkos::Impl::cuda_internal_safe_call(cudaError, char const*, char const*, int) ()
#9  0x0000555556938f8c in Kokkos::Impl::cuda_device_synchronize() ()
#10 0x000055555693b23f in Kokkos::Cuda::impl_static_fence() ()
#11 0x000055555692aa3a in Kokkos::Impl::(anonymous namespace)::fence_internal()
    ()
#12 0x000055555692d17c in Kokkos::fence() ()
#13 0x0000555555c5c621 in void nnpCbn::Mode<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::calculateSymmetryFunctionGroups<Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, Cabana::SerialOpTag, Cabana::SerialOpTag>(Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, int, Cabana::SerialOpTag, Cabana::SerialOpTag) ()
#14 0x0000555556135f13 in ForceNNP<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>, System_NNP<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA3>, NeighborVerlet<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>, Cabana::FullNeighborTag, Cabana::VerletLayout2D>, Cabana::SerialOpTag, Cabana::SerialOpTag>::compute(System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>*, NeighborVerlet<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>, Cabana::FullNeighborTag, Cabana::VerletLayout2D>*) ()
#15 0x00005555560cad86 in CbnMD<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>, NeighborVerlet<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>, Cabana::FullNeighborTag, Cabana::VerletLayout2D> >::init(InputCL) ()

Maybe this is now connected to the CUDA-related errors I saw when running the CabanaMD tests. Any more ideas how to get the CUDA device working?

singraber commented 3 years ago

Just a clarification:

The cudaDeviceSynchronize() error occurs after the SETUP: NEURAL NETWORK WEIGHTS section is complete.

streeve commented 3 years ago

The new PR should fix at least the first issue, but I was not able to recreate the cudaDeviceSynchronize error you mentioned next. I'm looking for another cluster to test on

singraber commented 3 years ago

Thanks for the PRs, I have just tried a combination of #79 and #80 and can report that the problems with the tests Integrator_test_CUDA and Neighbor_test_CUDA vanish. All CabanaMD tests now pass on my system.

Also, the SERIAL and OPENMP devices work for the NNP example, even with the Debug target and all extra debugging flags turned on.

Unfortunately, the NNP example still fails for CUDA (again the cudaDeviceSynchronize() error) but luckily with the debugging flags on I could now investigate a bit further by running with cuda-gdb which shows this error message:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x55555ac13c18 (nnp_mode.h:481)

Thread 1 "cbnMD" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 227, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
void Kokkos::Impl::cuda_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<Cabana::neighbor_parallel_for<nnpCbn::Mode<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::calculateSymmetryFunctionGroups<Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, Cabana::SerialOpTag, Cabana::SerialOpTag>(Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Caban--Type <RET> for more, q to quit, c to continue without paging--c
a::VerletLayout2D, Cabana::TeamVectorOpTag>, int, Cabana::SerialOpTag, Cabana::SerialOpTag)::{lambda(int, int)#1}, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, Kokkos::Cuda>(Kokkos::RangePolicy<Kokkos::Cuda> const&, nnpCbn::Mode<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::calculateSymmetryFunctionGroups<Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, Cabana::SerialOpTag, Cabana::SerialOpTag>(Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, int, Cabana::SerialOpTag, Cabana::SerialOpTag)::{lambda(int, int)#1} const&, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag> const&, Cabana::FirstNeighborsTag, Cabana::SerialOpTag, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(unsigned int)#1}, nnpCbn::Mode<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::calculateSymmetryFunctionGroups<Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, Cabana::SerialOpTag, Cabana::SerialOpTag>(Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, int, Cabana::SerialOpTag, Cabana::SerialOpTag)::{lambda(int, int)#1}<Kokkos::Cuda>, Kokkos::Cuda> >()<<<(4,1,1),(1,32,1)>>> () at /home/andi/local/src/CabanaMD/master/src/force_types/nnp_mode.h:481
481     double rci = rc * cutoffAlpha;

which points here:

https://github.com/ECP-copa/CabanaMD/blob/b7db71f8d3fb4b31b0e27b6717165920fdfb9608/src/force_types/nnp_mode.h#L481

That does look very harmless to me.. but I have little experience in CUDA programming. Is there something suspicious in the code?

Thanks for all your help so far!!

streeve commented 3 years ago

Thanks for going back and forth on this!

I was pointed to the https://github.com/kokkos/llvm-project clang-tidy to very helpfully find all the cases of implicit class member variable capture in the parallel kernels. This includes the case you pointed out with cutoffAlpha, as well as a handful of others.

Let me know if you hit anything else and I will keep working to get more testing.

singraber commented 3 years ago

That is great, thanks for investigating this.. I have tested now the latest changes from #80 and can happily report that everything works now. I can run the NNP example on all three devices SERIAL, OPENMP and CUDA, each in combination with the Release or the Debug compilation route.

Thanks for all your efforts!

streeve commented 3 years ago

Great! I will merge here and then push to CompPhysVienna/n2p2#49 as well