Open YiminJin opened 1 day ago
@kronbichler You saw the same error two days ago, right? This must have been a recent change.
This must have been a recent change.
We've had this error in the nightly for a while both for HIP and CUDA. The error is from compute_diagonal()
in Step-64. That's the only place we assert raw_diagonal
.
Just to confirm: Yes, this was the error I saw. I agree on the place where this comes from, the question is where the wrong computation comes from, as there should not be a zero there. I have on my todolist to investigate this in the coming days, but if someone has an idea before that would be great.
I tried to run the latest version of step-64 on GPU, and it run the following error:
Cycle 0 Number of active cells: 8 Number of degrees of freedom: 343 :0: : block: [0,0,0], thread: [0,108,0] Assertion, std::allocator > const&) [0x7f285d64803e]
Kokkos::View<double , Kokkos::CudaUVMSpace>::View<std::cxx11::basic_string<char, std::char_traits, std::allocator > >(Kokkos::Impl::ViewCtorProp<std:: cxx11::basic_string<char, std::char_traits, std::allocator > > const&, std::enable_if<!Kokkos::Impl::ViewCtorProp<std::cxx11::basic_string<char, std::char_traits, std::allocator > >::has_pointer, Kokkos::LayoutLeft>::type const&) [0x7f287089e5e8]
Kokkos::View<double*, Kokkos::CudaUVMSpace>::View<char [17]>(char const (&) [17], std::enable_if<Kokkos::Impl::is_view_label<char [17]>::value, unsigned long const>::type, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) [0x7f287087ca74]
dealii::MemorySpace::MemorySpaceData<double, dealii::MemorySpace::Default>::MemorySpaceData() [0x7f2870867808]
dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Default>::Vector() [0x7f287084e6af]
dealii::PreconditionChebyshev<Step64::HelmholtzOperator<3, 3>, dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Default>, dealii::DiagonalMatrix<dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Default> > >::PreconditionChebyshev() [0x673ceb]
Step64::HelmholtzProblem<3, 3>::solve() [0x6698d4]
Step64::HelmholtzProblem<3, 3>::run() [0x65e782]
main [0x64491f]
[0x7f2846b8ae08]
libc_start_main [0x7f2846b8aecc]
_start [0x6446d5]
[GEOIST:22119] Process received signal
[GEOIST:22119] Signal: Aborted (6)
[GEOIST:22119] Signal code: (-6)
[GEOIST:22119] [ 0] /usr/lib/libc.so.6(+0x3d1d0) [0x7f2846ba21d0]
[GEOIST:22119] [ 1] /usr/lib/libc.so.6(+0x963f4) [0x7f2846bfb3f4]
[GEOIST:22119] [ 2] /usr/lib/libc.so.6(gsignal+0x20) [0x7f2846ba2120]
[GEOIST:22119] [ 3] /usr/lib/libc.so.6(abort+0xdf) [0x7f2846b894c3]
[GEOIST:22119] [ 4] /opt/trilinos/14.2.0-nvcc/lib/libkokkoscore.so.14(_ZN6Kokkos4Impl17human_memory_sizeB5cxx11Em+0x0) [0x7f285d6378b0]
[GEOIST:22119] [ 5] /opt/trilinos/14.2.0-nvcc/lib/libkokkoscore.so.14(_ZN6Kokkos4Impl25cuda_internal_error_abortE9cudaErrorPKcS3_i+0xe4) [0x7f285d647474]
[GEOIST:22119] [ 6] /opt/trilinos/14.2.0-nvcc/lib/libkokkoscore.so.14(_ZN6Kokkos4Impl23cuda_stream_synchronizeEP11CUstream_stPKNS0_12CudaInternalERKNSt7cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x17e) [0x7f285d64803e]
[GEOIST:22119] [ 7] /home/jinym/projects/github/dealii/build-test/lib/libdeal_II.g.so.9.7.0-pre(_ZN6Kokkos4ViewIPdJNS_12CudaUVMSpaceEEEC2IJNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEERKNS_4Impl12ViewCtorPropIJDpT_EEERKNSt9enable_ifIXntsrSF_11has_pointerENS_10LayoutLeftEE4typeE+0x23e) [0x7f287089e5e8]
[GEOIST:22119] [ 8] /home/jinym/projects/github/dealii/build-test/lib/libdeal_II.g.so.9.7.0-pre(_ZN6Kokkos4ViewIPdJNS_12CudaUVMSpaceEEEC2IA17_cEERKT_NSt9enable_ifIXsrNS_4Impl13is_view_labelIS6_EE5valueEKmE4typeEmmmmmmm+0x9a) [0x7f287087ca74]
[GEOIST:22119] [ 9] /home/jinym/projects/github/dealii/build-test/lib/libdeal_II.g.so.9.7.0-pre(_ZN6dealii11MemorySpace15MemorySpaceDataIdNS0_7DefaultEEC1Ev+0x84) [0x7f2870867808]
[GEOIST:22119] [10] /home/jinym/projects/github/dealii/build-test/lib/libdeal_II.g.so.9.7.0-pre(_ZN6dealii13LinearAlgebra11distributed6VectorIdNS_11MemorySpace7DefaultEEC1Ev+0x75) [0x7f287084e6af]
[GEOIST:22119] [11] ./step-64(_ZN6dealii21PreconditionChebyshevIN6Step6417HelmholtzOperatorILi3ELi3EEENS_13LinearAlgebra11distributed6VectorIdNS_11MemorySpace7DefaultEEENS_14DiagonalMatrixIS9_EEEC1Ev+0x47) [0x673ceb]
[GEOIST:22119] [12] ./step-64(_ZN6Step6416HelmholtzProblemILi3ELi3EE5solveEv+0xfa) [0x6698d4]
[GEOIST:22119] [13] ./step-64(_ZN6Step6416HelmholtzProblemILi3ELi3EE3runEv+0x146) [0x65e782]
[GEOIST:22119] [14] ./step-64(main+0x74) [0x64491f]
[GEOIST:22119] [15] /usr/lib/libc.so.6(+0x25e08) [0x7f2846b8ae08]
[GEOIST:22119] [16] /usr/lib/libc.so.6(libc_start_main+0x8c) [0x7f2846b8aecc]
[GEOIST:22119] [17] ./step-64(_start+0x25) [0x6446d5]
[GEOIST:22119] End of error message
zsh: IOT instruction (core dumped) ./step-64
raw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,111,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,103,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,112,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,115,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,83,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,86,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,11,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,20,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,23,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,43,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,56,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,59,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,95,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,31,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,11,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,28,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,31,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,64,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,67,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,87,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,95,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,39,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,47,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,60,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,63,0] Assertionraw_diagonal[i] > 0.
failed. cudaStreamSynchronize(stream) error( cudaErrorAssert): device-side assert triggered /home/jinym/build/trilinos/14.2.0/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:144 Backtrace: Kokkos::Impl::save_stacktrace() [0x7f285d6400c5] Kokkos::Impl::traceback_callstack(std::ostream&) [0x7f285d63787a] Kokkos::Impl::host_abort(char const) [0x7f285d6378ab] Kokkos::Impl::cuda_internal_error_abort(cudaError, char const, char const, int) [0x7f285d647474] Kokkos::Impl::cuda_stream_synchronize(CUstream_st, Kokkos::Impl::CudaInternal const, std::__cxx11::basic_string<char, std::char_traitsThe error message shows that the error occurs in the constructor of
PreconditionChebyshev
. When the preconditioner is changed toPreconditionIdentity
, the code works well.I tried to trace the bug with gdb, and it led me to this place (Kokkos_View.hpp in the Kokkos directory):
The lines inside the
if()
are executed, which means that the code tries to allocate memory space fromCudaUVMSpace
. I cannot understand it, because the Kokkos version in my trilinos library is 4.0.1 and the CMAKE optionKOKKOS_ENABLE_CUDA_UVM
is OFF (I have checked that in CMakeCache.txt). I also cannot understand why the GPU memory allocation inPreconditionChebyshev
runs into error, while the GPU memory allocation in other places (such asSolverCG
) does not. The configurations of trilinos and dealii on my machine are as follows:!/bin/bash
cmake \ -D CMAKE_CXX_COMPILER=mpicxx \ -D CMAKE_C_COMPILER=mpicc \ -D CMAKE_Fortran_COMPILER=mpifort \ -D CMAKE_CXX_FLAGS="-g -lineinfo -Xcudafe \ --diag_suppress=conversion_function_not_usable -Xcudafe \ --diag_suppress=cc_clobber_ignored -Xcudafe \ --diag_suppress=code_is_unreachable" \ -D TPL_ENABLE_Boost=OFF \ -D TPL_ENABLE_MPI=ON \ -D TPL_ENABLE_CUDA=ON \ -D Kokkos_ENABLE_CUDA=ON \ -D Kokkos_ENABLE_CUDA_LAMBDA=ON \ -D Kokkos_ENABLE_CUDA_CONSTEXPR=ON \ -D Kokkos_ENABLE_CUDA_UVM=OFF \ -D Trilinos_ENABLE_Amesos=ON \ -D Trilinos_ENABLE_Epetra=ON \ -D Trilinos_ENABLE_EpetraExt=ON \ -D Trilinos_ENABLE_Ifpack=ON \ -D Trilinos_ENABLE_AztecOO=ON \ -D Trilinos_ENABLE_Sacado=ON \ -D Trilinos_ENABLE_Teuchos=ON \ -D Trilinos_ENABLE_MueLu=ON \ -D Trilinos_ENABLE_ML=ON \ -D Trilinos_ENABLE_ROL=ON \ -D Trilinos_ENABLE_Tpetra=ON \ -D Trilinos_ENABLE_Zoltan=ON \ -D Trilinos_ENABLE_Fortran=OFF \ -D Trilinos_VERBOSE_CONFIGURE=OFF \ -D BUILD_SHARED_LIBS=ON \ -D CMAKE_VERBOSE_MAKEFILE=OFF \ -D CMAKE_BUILD_TYPE=RELEASE \ -D CMAKE_INSTALL_PREFIX=/opt/trilinos/14.2.0-nvcc \ ..
cmake \ -D CMAKE_CXX_COMPILER=/opt/trilinos/14.2.0-nvcc/bin/nvcc_wrapper \ -D DEAL_II_WITH_MPI=ON \ -D DEAL_II_MPI_WITH_DEVICE_SUPPORT=ON \ -D DEAL_II_WITH_TBB=OFF \ -D DEAL_II_WITH_LAPACK=ON \ -D DEAL_II_WITH_P4EST=ON \ -D P4EST_DIR=/opt/p4est/2.8 \ -D DEAL_II_WITH_METIS=OFF \ -D DEAL_II_WITH_KOKKOS=ON \ -D DEAL_II_WITH_TRILINOS=ON \ -D TRILINOS_DIR=/opt/trilinos/14.2.0-nvcc \ -D DEAL_II_WITH_SUNDIALS=OFF \ -D DEAL_II_WITH_HDF5=OFF \ -D DEAL_II_WITH_GMSH=OFF \ -D DEAL_II_WITH_VTK=OFF \ -D DEAL_II_COMPONENT_EXAMPLES=OFF \ -D CMAKE_INSTALL_PREFIX=/opt/dealii/9.7.0-pre-test \ ..
Could you please help me? @masterleinad @tjhei