Closed singraber closed 3 years ago
Thanks for posting this! And sorry that even running the default case for NNP fails - I'll take a look at this right now.
I can reproduce the error with Debug
and will link to the PR to fix when I get it figured out.
Using Release
(and without the additional Kokkos debug flags) runs, but this is another reminder to get GPU CI running sooner rather than later.
Thanks for looking so quickly at this! I just rebuilt everything with Release
target and with removed flags:
Kokkos:
# -DKokkos_ENABLE_TESTS=On \
# -DKokkos_ENABLE_DEBUG=On \
# -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=On \
Cabana:
# -DCabana_ENABLE_EXAMPLES=On \
# -DCabana_ENABLE_TESTING=On \
CabanaMD:
# -DCabanaMD_ENABLE_TESTING=ON \
The result is that now the NNP example works fine for the device types SERIAL
and OPENMP
. Unfortunately, CUDA
still does not work, giving this error:
what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/andi/local/src/kokkos/3.1.01/core/src/Cuda/Kokkos_Cuda_Instance.cpp:143
gdb
gives now no more line numbers but only:
#7 0x00005555569390c4 in Kokkos::Impl::cuda_internal_error_throw(cudaError, char const*, char const*, int) ()
#8 0x00005555569380d0 in Kokkos::Impl::cuda_internal_safe_call(cudaError, char const*, char const*, int) ()
#9 0x0000555556938f8c in Kokkos::Impl::cuda_device_synchronize() ()
#10 0x000055555693b23f in Kokkos::Cuda::impl_static_fence() ()
#11 0x000055555692aa3a in Kokkos::Impl::(anonymous namespace)::fence_internal()
()
#12 0x000055555692d17c in Kokkos::fence() ()
#13 0x0000555555c5c621 in void nnpCbn::Mode<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::calculateSymmetryFunctionGroups<Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, Cabana::SerialOpTag, Cabana::SerialOpTag>(Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, int, Cabana::SerialOpTag, Cabana::SerialOpTag) ()
#14 0x0000555556135f13 in ForceNNP<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>, System_NNP<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA3>, NeighborVerlet<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>, Cabana::FullNeighborTag, Cabana::VerletLayout2D>, Cabana::SerialOpTag, Cabana::SerialOpTag>::compute(System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>*, NeighborVerlet<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>, Cabana::FullNeighborTag, Cabana::VerletLayout2D>*) ()
#15 0x00005555560cad86 in CbnMD<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>, NeighborVerlet<System<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, AoSoA6>, Cabana::FullNeighborTag, Cabana::VerletLayout2D> >::init(InputCL) ()
Maybe this is now connected to the CUDA-related errors I saw when running the CabanaMD tests. Any more ideas how to get the CUDA
device working?
Just a clarification:
The cudaDeviceSynchronize() error
occurs after the SETUP: NEURAL NETWORK WEIGHTS
section is complete.
The new PR should fix at least the first issue, but I was not able to recreate the cudaDeviceSynchronize
error you mentioned next. I'm looking for another cluster to test on
Thanks for the PRs, I have just tried a combination of #79 and #80 and can report that the problems with the tests Integrator_test_CUDA
and Neighbor_test_CUDA
vanish. All CabanaMD tests now pass on my system.
Also, the SERIAL
and OPENMP
devices work for the NNP example, even with the Debug
target and all extra debugging flags turned on.
Unfortunately, the NNP example still fails for CUDA
(again the cudaDeviceSynchronize()
error) but luckily with the debugging flags on I could now investigate a bit further by running with cuda-gdb
which shows this error message:
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x55555ac13c18 (nnp_mode.h:481)
Thread 1 "cbnMD" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 227, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
void Kokkos::Impl::cuda_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<Cabana::neighbor_parallel_for<nnpCbn::Mode<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::calculateSymmetryFunctionGroups<Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, Cabana::SerialOpTag, Cabana::SerialOpTag>(Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Caban--Type <RET> for more, q to quit, c to continue without paging--c
a::VerletLayout2D, Cabana::TeamVectorOpTag>, int, Cabana::SerialOpTag, Cabana::SerialOpTag)::{lambda(int, int)#1}, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, Kokkos::Cuda>(Kokkos::RangePolicy<Kokkos::Cuda> const&, nnpCbn::Mode<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::calculateSymmetryFunctionGroups<Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, Cabana::SerialOpTag, Cabana::SerialOpTag>(Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, int, Cabana::SerialOpTag, Cabana::SerialOpTag)::{lambda(int, int)#1} const&, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag> const&, Cabana::FirstNeighborsTag, Cabana::SerialOpTag, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(unsigned int)#1}, nnpCbn::Mode<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::calculateSymmetryFunctionGroups<Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, Cabana::SerialOpTag, Cabana::SerialOpTag>(Cabana::Slice<double [3], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 96>, Cabana::Slice<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::DefaultAccessMemory, 32, 32>, Cabana::Slice<double [30], Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Cabana::AtomicAccessMemory, 1, 30>, Cabana::VerletList<Kokkos::CudaUVMSpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D, Cabana::TeamVectorOpTag>, int, Cabana::SerialOpTag, Cabana::SerialOpTag)::{lambda(int, int)#1}<Kokkos::Cuda>, Kokkos::Cuda> >()<<<(4,1,1),(1,32,1)>>> () at /home/andi/local/src/CabanaMD/master/src/force_types/nnp_mode.h:481
481 double rci = rc * cutoffAlpha;
which points here:
That does look very harmless to me.. but I have little experience in CUDA programming. Is there something suspicious in the code?
Thanks for all your help so far!!
Thanks for going back and forth on this!
I was pointed to the https://github.com/kokkos/llvm-project clang-tidy
to very helpfully find all the cases of implicit class member variable capture in the parallel kernels. This includes the case you pointed out with cutoffAlpha
, as well as a handful of others.
Let me know if you hit anything else and I will keep working to get more testing.
That is great, thanks for investigating this.. I have tested now the latest changes from #80 and can happily report that everything works now. I can run the NNP example on all three devices SERIAL
, OPENMP
and CUDA
, each in combination with the Release
or the Debug
compilation route.
Thanks for all your efforts!
Great! I will merge here and then push to CompPhysVienna/n2p2#49 as well
I am trying to run the NNP example in
input/in.nnp
but after the symmetry function setup is completed I get the following error in theSETUP: SYMMETRY FUNCTION GROUPS
section:I am starting CabanaMD with the following command:
The error occurs with any of the three device targets:
SERIAL
,OPENMP
andCUDA
When I run with
gdb
and look at the backtrace I find:which brings me here:
https://github.com/ECP-copa/CabanaMD/blob/562600e9cbd2c8ee2ecfb34fe70630cebfda5e97/src/force_types/nnp_element_impl.h#L375
and then descends into Kokkos... do you have any idea why this error happens and how I can resolve it?
I used the following setup to compile Kokkos, Cabana and CabanaMD:
My system:
Kokkos (version 3.1.01) build flags:
In the
nvcc_wrapper
script I setdefault_arch="sm_61"
.Cabana (66c94f6) build flags:
CabanaMD (562600e) build flags:
There is also an additional issue with the tests of CabanaMD which may be unrelated but who knows...:
The tests of Kokkos and Cabana pass without any errors but when I run
ctest -VV
in the CabanaMD build directory I get the same error for both CUDA-related tests (Integrator_test_CUDA
andNeighbor_test_CUDA
):Running the tests manually and backtracing with
gdb
shows:and
for
Integrator_test_CUDA
andNeighbor_test_CUDA
, respectively.Sorry for this overly long post... I am out of ideas for now, any help is greatly appreciated!
Thank you!!