dealii / dealii

The development repository for the deal.II finite element library
https://www.dealii.org
Other
1.38k stars 745 forks source link

Runtime error in step-64 #17869

Open YiminJin opened 1 day ago

YiminJin commented 1 day ago

I tried to run the latest version of step-64 on GPU, and it run the following error:

Cycle 0 Number of active cells: 8 Number of degrees of freedom: 343 :0: : block: [0,0,0], thread: [0,108,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [0,0,0], thread: [0,111,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [1,0,0], thread: [0,103,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [1,0,0], thread: [0,112,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [1,0,0], thread: [0,115,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [2,0,0], thread: [0,83,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [2,0,0], thread: [0,86,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [2,0,0], thread: [0,11,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [2,0,0], thread: [0,20,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [2,0,0], thread: [0,23,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [2,0,0], thread: [0,43,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [2,0,0], thread: [0,56,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [2,0,0], thread: [0,59,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [0,0,0], thread: [0,95,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [0,0,0], thread: [0,31,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [1,0,0], thread: [0,11,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [1,0,0], thread: [0,28,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [1,0,0], thread: [0,31,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [1,0,0], thread: [0,64,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [1,0,0], thread: [0,67,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [1,0,0], thread: [0,87,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [1,0,0], thread: [0,95,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [0,0,0], thread: [0,39,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [0,0,0], thread: [0,47,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [0,0,0], thread: [0,60,0] Assertion raw_diagonal[i] > 0. failed. :0: : block: [0,0,0], thread: [0,63,0] Assertion raw_diagonal[i] > 0. failed. cudaStreamSynchronize(stream) error( cudaErrorAssert): device-side assert triggered /home/jinym/build/trilinos/14.2.0/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:144 Backtrace: Kokkos::Impl::save_stacktrace() [0x7f285d6400c5] Kokkos::Impl::traceback_callstack(std::ostream&) [0x7f285d63787a] Kokkos::Impl::host_abort(char const) [0x7f285d6378ab] Kokkos::Impl::cuda_internal_error_abort(cudaError, char const, char const, int) [0x7f285d647474] Kokkos::Impl::cuda_stream_synchronize(CUstream_st, Kokkos::Impl::CudaInternal const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) [0x7f285d64803e] Kokkos::View<double, Kokkos::CudaUVMSpace>::View<std::cxx11::basic_string<char, std::char_traits, std::allocator > >(Kokkos::Impl::ViewCtorProp<std::cxx11::basic_string<char, std::char_traits, std::allocator > > const&, std::enable_if<!Kokkos::Impl::ViewCtorProp<std::cxx11::basic_string<char, std::char_traits, std::allocator > >::has_pointer, Kokkos::LayoutLeft>::type const&) [0x7f287089e5e8] Kokkos::View<double*, Kokkos::CudaUVMSpace>::View<char [17]>(char const (&) [17], std::enable_if<Kokkos::Impl::is_view_label<char [17]>::value, unsigned long const>::type, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) [0x7f287087ca74] dealii::MemorySpace::MemorySpaceData<double, dealii::MemorySpace::Default>::MemorySpaceData() [0x7f2870867808] dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Default>::Vector() [0x7f287084e6af] dealii::PreconditionChebyshev<Step64::HelmholtzOperator<3, 3>, dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Default>, dealii::DiagonalMatrix<dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Default> > >::PreconditionChebyshev() [0x673ceb] Step64::HelmholtzProblem<3, 3>::solve() [0x6698d4] Step64::HelmholtzProblem<3, 3>::run() [0x65e782] main [0x64491f] [0x7f2846b8ae08] libc_start_main [0x7f2846b8aecc] _start [0x6446d5] [GEOIST:22119] Process received signal [GEOIST:22119] Signal: Aborted (6) [GEOIST:22119] Signal code: (-6) [GEOIST:22119] [ 0] /usr/lib/libc.so.6(+0x3d1d0) [0x7f2846ba21d0] [GEOIST:22119] [ 1] /usr/lib/libc.so.6(+0x963f4) [0x7f2846bfb3f4] [GEOIST:22119] [ 2] /usr/lib/libc.so.6(gsignal+0x20) [0x7f2846ba2120] [GEOIST:22119] [ 3] /usr/lib/libc.so.6(abort+0xdf) [0x7f2846b894c3] [GEOIST:22119] [ 4] /opt/trilinos/14.2.0-nvcc/lib/libkokkoscore.so.14(_ZN6Kokkos4Impl17human_memory_sizeB5cxx11Em+0x0) [0x7f285d6378b0] [GEOIST:22119] [ 5] /opt/trilinos/14.2.0-nvcc/lib/libkokkoscore.so.14(_ZN6Kokkos4Impl25cuda_internal_error_abortE9cudaErrorPKcS3_i+0xe4) [0x7f285d647474] [GEOIST:22119] [ 6] /opt/trilinos/14.2.0-nvcc/lib/libkokkoscore.so.14(_ZN6Kokkos4Impl23cuda_stream_synchronizeEP11CUstream_stPKNS0_12CudaInternalERKNSt7cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x17e) [0x7f285d64803e] [GEOIST:22119] [ 7] /home/jinym/projects/github/dealii/build-test/lib/libdeal_II.g.so.9.7.0-pre(_ZN6Kokkos4ViewIPdJNS_12CudaUVMSpaceEEEC2IJNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEERKNS_4Impl12ViewCtorPropIJDpT_EEERKNSt9enable_ifIXntsrSF_11has_pointerENS_10LayoutLeftEE4typeE+0x23e) [0x7f287089e5e8] [GEOIST:22119] [ 8] /home/jinym/projects/github/dealii/build-test/lib/libdeal_II.g.so.9.7.0-pre(_ZN6Kokkos4ViewIPdJNS_12CudaUVMSpaceEEEC2IA17_cEERKT_NSt9enable_ifIXsrNS_4Impl13is_view_labelIS6_EE5valueEKmE4typeEmmmmmmm+0x9a) [0x7f287087ca74] [GEOIST:22119] [ 9] /home/jinym/projects/github/dealii/build-test/lib/libdeal_II.g.so.9.7.0-pre(_ZN6dealii11MemorySpace15MemorySpaceDataIdNS0_7DefaultEEC1Ev+0x84) [0x7f2870867808] [GEOIST:22119] [10] /home/jinym/projects/github/dealii/build-test/lib/libdeal_II.g.so.9.7.0-pre(_ZN6dealii13LinearAlgebra11distributed6VectorIdNS_11MemorySpace7DefaultEEC1Ev+0x75) [0x7f287084e6af] [GEOIST:22119] [11] ./step-64(_ZN6dealii21PreconditionChebyshevIN6Step6417HelmholtzOperatorILi3ELi3EEENS_13LinearAlgebra11distributed6VectorIdNS_11MemorySpace7DefaultEEENS_14DiagonalMatrixIS9_EEEC1Ev+0x47) [0x673ceb] [GEOIST:22119] [12] ./step-64(_ZN6Step6416HelmholtzProblemILi3ELi3EE5solveEv+0xfa) [0x6698d4] [GEOIST:22119] [13] ./step-64(_ZN6Step6416HelmholtzProblemILi3ELi3EE3runEv+0x146) [0x65e782] [GEOIST:22119] [14] ./step-64(main+0x74) [0x64491f] [GEOIST:22119] [15] /usr/lib/libc.so.6(+0x25e08) [0x7f2846b8ae08] [GEOIST:22119] [16] /usr/lib/libc.so.6(libc_start_main+0x8c) [0x7f2846b8aecc] [GEOIST:22119] [17] ./step-64(_start+0x25) [0x6446d5] [GEOIST:22119] End of error message zsh: IOT instruction (core dumped) ./step-64

The error message shows that the error occurs in the constructor of PreconditionChebyshev. When the preconditioner is changed to PreconditionIdentity, the code works well.

I tried to trace the bug with gdb, and it led me to this place (Kokkos_View.hpp in the Kokkos directory):

// If allocating in CudaUVMSpace must fence before and after
// the allocation to protect against possible concurrent access
// on the CPU and the GPU.
// Fence using the trait's execution space (which will be Kokkos::Cuda)
// to avoid incomplete type errors from using Kokkos::Cuda directly.
if (std::is_same<Kokkos::CudaUVMSpace,
                 typename traits::device_type::memory_space>::value) {
/*the line that triggers the error*/  typename traits::device_type::memory_space::execution_space().fence(
      "Kokkos::View<...>::View: fence before allocating UVM");
}

The lines inside the if() are executed, which means that the code tries to allocate memory space from CudaUVMSpace. I cannot understand it, because the Kokkos version in my trilinos library is 4.0.1 and the CMAKE option KOKKOS_ENABLE_CUDA_UVM is OFF (I have checked that in CMakeCache.txt). I also cannot understand why the GPU memory allocation in PreconditionChebyshev runs into error, while the GPU memory allocation in other places (such as SolverCG) does not. The configurations of trilinos and dealii on my machine are as follows:

!/bin/bash

cmake \ -D CMAKE_CXX_COMPILER=mpicxx \ -D CMAKE_C_COMPILER=mpicc \ -D CMAKE_Fortran_COMPILER=mpifort \ -D CMAKE_CXX_FLAGS="-g -lineinfo -Xcudafe \ --diag_suppress=conversion_function_not_usable -Xcudafe \ --diag_suppress=cc_clobber_ignored -Xcudafe \ --diag_suppress=code_is_unreachable" \ -D TPL_ENABLE_Boost=OFF \ -D TPL_ENABLE_MPI=ON \ -D TPL_ENABLE_CUDA=ON \ -D Kokkos_ENABLE_CUDA=ON \ -D Kokkos_ENABLE_CUDA_LAMBDA=ON \ -D Kokkos_ENABLE_CUDA_CONSTEXPR=ON \ -D Kokkos_ENABLE_CUDA_UVM=OFF \ -D Trilinos_ENABLE_Amesos=ON \ -D Trilinos_ENABLE_Epetra=ON \ -D Trilinos_ENABLE_EpetraExt=ON \ -D Trilinos_ENABLE_Ifpack=ON \ -D Trilinos_ENABLE_AztecOO=ON \ -D Trilinos_ENABLE_Sacado=ON \ -D Trilinos_ENABLE_Teuchos=ON \ -D Trilinos_ENABLE_MueLu=ON \ -D Trilinos_ENABLE_ML=ON \ -D Trilinos_ENABLE_ROL=ON \ -D Trilinos_ENABLE_Tpetra=ON \ -D Trilinos_ENABLE_Zoltan=ON \ -D Trilinos_ENABLE_Fortran=OFF \ -D Trilinos_VERBOSE_CONFIGURE=OFF \ -D BUILD_SHARED_LIBS=ON \ -D CMAKE_VERBOSE_MAKEFILE=OFF \ -D CMAKE_BUILD_TYPE=RELEASE \ -D CMAKE_INSTALL_PREFIX=/opt/trilinos/14.2.0-nvcc \ ..

cmake \ -D CMAKE_CXX_COMPILER=/opt/trilinos/14.2.0-nvcc/bin/nvcc_wrapper \ -D DEAL_II_WITH_MPI=ON \ -D DEAL_II_MPI_WITH_DEVICE_SUPPORT=ON \ -D DEAL_II_WITH_TBB=OFF \ -D DEAL_II_WITH_LAPACK=ON \ -D DEAL_II_WITH_P4EST=ON \ -D P4EST_DIR=/opt/p4est/2.8 \ -D DEAL_II_WITH_METIS=OFF \ -D DEAL_II_WITH_KOKKOS=ON \ -D DEAL_II_WITH_TRILINOS=ON \ -D TRILINOS_DIR=/opt/trilinos/14.2.0-nvcc \ -D DEAL_II_WITH_SUNDIALS=OFF \ -D DEAL_II_WITH_HDF5=OFF \ -D DEAL_II_WITH_GMSH=OFF \ -D DEAL_II_WITH_VTK=OFF \ -D DEAL_II_COMPONENT_EXAMPLES=OFF \ -D CMAKE_INSTALL_PREFIX=/opt/dealii/9.7.0-pre-test \ ..

Could you please help me? @masterleinad @tjhei

tjhei commented 23 hours ago

@kronbichler You saw the same error two days ago, right? This must have been a recent change.

Rombur commented 13 hours ago

This must have been a recent change.

We've had this error in the nightly for a while both for HIP and CUDA. The error is from compute_diagonal() in Step-64. That's the only place we assert raw_diagonal.

kronbichler commented 6 hours ago

Just to confirm: Yes, this was the error I saw. I agree on the place where this comes from, the question is where the wrong computation comes from, as there should not be a zero there. I have on my todolist to investigate this in the coming days, but if someone has an idea before that would be great.