Open YiminJin opened 1 day ago
@kronbichler You saw the same error two days ago, right? This must have been a recent change.
This must have been a recent change.
We've had this error in the nightly for a while both for HIP and CUDA. The error is from compute_diagonal()
in Step-64. That's the only place we assert raw_diagonal
Just to confirm: Yes, this was the error I saw. I agree on the place where this comes from, the question is where the wrong computation comes from, as there should not be a zero there. I have on my todolist to investigate this in the coming days, but if someone has an idea before that would be great.
I tried to run the latest version of step-64 on GPU, and it run the following error:
Cycle 0 Number of active cells: 8 Number of degrees of freedom: 343 :0: : block: [0,0,0], thread: [0,108,0] Assertion, std::allocator > const&) [0x7f285d64803e]
Kokkos::View<double , Kokkos::CudaUVMSpace>::View<std::cxx11::basic_string<char, std::char_traits, std::allocator > >(Kokkos::Impl::ViewCtorProp<std:: cxx11::basic_string<char, std::char_traits, std::allocator > > const&, std::enable_if<!Kokkos::Impl::ViewCtorProp<std::cxx11::basic_string<char, std::char_traits, std::allocator > >::has_pointer, Kokkos::LayoutLeft>::type const&) [0x7f287089e5e8]
Kokkos::View<double*, Kokkos::CudaUVMSpace>::View<char [17]>(char const (&) [17], std::enable_if<Kokkos::Impl::is_view_label<char [17]>::value, unsigned long const>::type, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) [0x7f287087ca74]
dealii::MemorySpace::MemorySpaceData<double, dealii::MemorySpace::Default>::MemorySpaceData() [0x7f2870867808]
dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Default>::Vector() [0x7f287084e6af]
dealii::PreconditionChebyshev<Step64::HelmholtzOperator<3, 3>, dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Default>, dealii::DiagonalMatrix<dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Default> > >::PreconditionChebyshev() [0x673ceb]
Step64::HelmholtzProblem<3, 3>::solve() [0x6698d4]
Step64::HelmholtzProblem<3, 3>::run() [0x65e782]
main [0x64491f]
libc_start_main [0x7f2846b8aecc]
_start [0x6446d5]
[GEOIST:22119] Process received signal
[GEOIST:22119] Signal: Aborted (6)
[GEOIST:22119] Signal code: (-6)
[GEOIST:22119] [ 0] /usr/lib/ [0x7f2846ba21d0]
[GEOIST:22119] [ 1] /usr/lib/ [0x7f2846bfb3f4]
[GEOIST:22119] [ 2] /usr/lib/ [0x7f2846ba2120]
[GEOIST:22119] [ 3] /usr/lib/ [0x7f2846b894c3]
[GEOIST:22119] [ 4] /opt/trilinos/14.2.0-nvcc/lib/ [0x7f285d6378b0]
[GEOIST:22119] [ 5] /opt/trilinos/14.2.0-nvcc/lib/ [0x7f285d647474]
[GEOIST:22119] [ 6] /opt/trilinos/14.2.0-nvcc/lib/ [0x7f285d64803e]
[GEOIST:22119] [ 7] /home/jinym/projects/github/dealii/build-test/lib/ [0x7f287089e5e8]
[GEOIST:22119] [ 8] /home/jinym/projects/github/dealii/build-test/lib/ [0x7f287087ca74]
[GEOIST:22119] [ 9] /home/jinym/projects/github/dealii/build-test/lib/ [0x7f2870867808]
[GEOIST:22119] [10] /home/jinym/projects/github/dealii/build-test/lib/ [0x7f287084e6af]
[GEOIST:22119] [11] ./step-64(_ZN6dealii21PreconditionChebyshevIN6Step6417HelmholtzOperatorILi3ELi3EEENS_13LinearAlgebra11distributed6VectorIdNS_11MemorySpace7DefaultEEENS_14DiagonalMatrixIS9_EEEC1Ev+0x47) [0x673ceb]
[GEOIST:22119] [12] ./step-64(_ZN6Step6416HelmholtzProblemILi3ELi3EE5solveEv+0xfa) [0x6698d4]
[GEOIST:22119] [13] ./step-64(_ZN6Step6416HelmholtzProblemILi3ELi3EE3runEv+0x146) [0x65e782]
[GEOIST:22119] [14] ./step-64(main+0x74) [0x64491f]
[GEOIST:22119] [15] /usr/lib/ [0x7f2846b8ae08]
[GEOIST:22119] [16] /usr/lib/ [0x7f2846b8aecc]
[GEOIST:22119] [17] ./step-64(_start+0x25) [0x6446d5]
[GEOIST:22119] End of error message
zsh: IOT instruction (core dumped) ./step-64
raw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,111,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,103,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,112,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,115,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,83,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,86,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,11,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,20,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,23,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,43,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,56,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [2,0,0], thread: [0,59,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,95,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,31,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,11,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,28,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,31,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,64,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,67,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,87,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [1,0,0], thread: [0,95,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,39,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,47,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,60,0] Assertionraw_diagonal[i] > 0.
failed. :0: : block: [0,0,0], thread: [0,63,0] Assertionraw_diagonal[i] > 0.
failed. cudaStreamSynchronize(stream) error( cudaErrorAssert): device-side assert triggered /home/jinym/build/trilinos/14.2.0/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:144 Backtrace: Kokkos::Impl::save_stacktrace() [0x7f285d6400c5] Kokkos::Impl::traceback_callstack(std::ostream&) [0x7f285d63787a] Kokkos::Impl::host_abort(char const) [0x7f285d6378ab] Kokkos::Impl::cuda_internal_error_abort(cudaError, char const, char const, int) [0x7f285d647474] Kokkos::Impl::cuda_stream_synchronize(CUstream_st, Kokkos::Impl::CudaInternal const, std::__cxx11::basic_string<char, std::char_traitsThe error message shows that the error occurs in the constructor of
. When the preconditioner is changed toPreconditionIdentity
, the code works well.I tried to trace the bug with gdb, and it led me to this place (Kokkos_View.hpp in the Kokkos directory):
The lines inside the
are executed, which means that the code tries to allocate memory space fromCudaUVMSpace
. I cannot understand it, because the Kokkos version in my trilinos library is 4.0.1 and the CMAKE optionKOKKOS_ENABLE_CUDA_UVM
is OFF (I have checked that in CMakeCache.txt). I also cannot understand why the GPU memory allocation inPreconditionChebyshev
runs into error, while the GPU memory allocation in other places (such asSolverCG
) does not. The configurations of trilinos and dealii on my machine are as follows:!/bin/bash
cmake \ -D CMAKE_CXX_COMPILER=mpicxx \ -D CMAKE_C_COMPILER=mpicc \ -D CMAKE_Fortran_COMPILER=mpifort \ -D CMAKE_CXX_FLAGS="-g -lineinfo -Xcudafe \ --diag_suppress=conversion_function_not_usable -Xcudafe \ --diag_suppress=cc_clobber_ignored -Xcudafe \ --diag_suppress=code_is_unreachable" \ -D TPL_ENABLE_Boost=OFF \ -D TPL_ENABLE_MPI=ON \ -D TPL_ENABLE_CUDA=ON \ -D Kokkos_ENABLE_CUDA=ON \ -D Kokkos_ENABLE_CUDA_LAMBDA=ON \ -D Kokkos_ENABLE_CUDA_CONSTEXPR=ON \ -D Kokkos_ENABLE_CUDA_UVM=OFF \ -D Trilinos_ENABLE_Amesos=ON \ -D Trilinos_ENABLE_Epetra=ON \ -D Trilinos_ENABLE_EpetraExt=ON \ -D Trilinos_ENABLE_Ifpack=ON \ -D Trilinos_ENABLE_AztecOO=ON \ -D Trilinos_ENABLE_Sacado=ON \ -D Trilinos_ENABLE_Teuchos=ON \ -D Trilinos_ENABLE_MueLu=ON \ -D Trilinos_ENABLE_ML=ON \ -D Trilinos_ENABLE_ROL=ON \ -D Trilinos_ENABLE_Tpetra=ON \ -D Trilinos_ENABLE_Zoltan=ON \ -D Trilinos_ENABLE_Fortran=OFF \ -D Trilinos_VERBOSE_CONFIGURE=OFF \ -D BUILD_SHARED_LIBS=ON \ -D CMAKE_VERBOSE_MAKEFILE=OFF \ -D CMAKE_BUILD_TYPE=RELEASE \ -D CMAKE_INSTALL_PREFIX=/opt/trilinos/14.2.0-nvcc \ ..
Could you please help me? @masterleinad @tjhei