GEOS-DEV / GEOS

GEOS Simulation Framework
GNU Lesser General Public License v2.1
211 stars 85 forks source link

Crash in TableFunction on GPU #1394

Closed francoishamon closed 3 years ago

francoishamon commented 3 years ago

Describe the bug On GPU platforms other than Lassen, the integrated tests that use constitutive models based on TableFunction (dead-oil, relperms) fail at initialization with the following error:

** StackTrace of 24 frames **
Frame 0: LvArray::system::stackTraceHandler(int, bool) 
Frame 1: /home/jfranc/code/cxx/GEOSX/build-gcc8-cuda-debug/lib/libgeosx_core.so 
Frame 2: /home/jfranc/code/cxx/GEOSX/build-gcc8-cuda-debug/lib/libgeosx_core.so 
Frame 3: /lib/x86_64-linux-gnu/libc.so.6 
Frame 4: gsignal 
Frame 5: abort 
Frame 6: /lib/x86_64-linux-gnu/libstdc++.so.6 
Frame 7: /lib/x86_64-linux-gnu/libstdc++.so.6 
Frame 8: /lib/x86_64-linux-gnu/libstdc++.so.6 
Frame 9: __gxx_personality_v0 
Frame 10: /lib/x86_64-linux-gnu/libgcc_s.so.1 
Frame 11: _Unwind_Resume 
Frame 12: LvArray::ChaiBuffer<geosx::TableFunction::KernelWrapper>::free() 
Frame 13: void LvArray::bufferManipulation::free<LvArray::ChaiBuffer<geosx::TableFunction::KernelWrapper> >(LvArray::ChaiBuffer<geosx::TableFunction::KernelWrapper>&, long) 
Frame 14: LvArray::Array<geosx::TableFunction::KernelWrapper, 1, camp::int_seq<long, 0l>, long, LvArray::ChaiBuffer>::~Array() 
Frame 15: geosx::constitutive::DeadOilFluid::~DeadOilFluid() 
Frame 16: geosx::constitutive::DeadOilFluid::~DeadOilFluid() 
Frame 17: std::enable_if<std::is_same<geosx::dataRepository::Group*, geosx::dataRepository::Group*>::value, void>::type geosx::MappedVector<geosx::dataRepository::Group, geosx::dataRepository::Group*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long>::deleteValue<geosx::dataRepository::Group*>(long) 
Frame 18: geosx::MappedVector<geosx::dataRepository::Group, geosx::dataRepository::Group*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long>::clear() 
Frame 19: geosx::MappedVector<geosx::dataRepository::Group, geosx::dataRepository::Group*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long>::~MappedVector() 
Frame 20: geosx::dataRepository::Group::~Group() 
Frame 21: geosx::ElementSubRegionBase::~ElementSubRegionBase() 
Frame 22: geosx::CellBlock::~CellBlock() 
Frame 23: geosx::CellElementSubRegion::~CellElementSubRegion() 
Frame 24: geosx::CellElementSubRegion::~CellElementSubRegion()

so I must have made a mistake somewhere, but did not notice it on Lassen (everything works there).

To Reproduce Steps to reproduce the behavior:

  1. Compile GEOSX (gcc) on Stanford GPU's cluster, or Pecan (Total), or Pangea 3 (Total)
  2. Run an integrated test that uses TableFunction I did not manage to reproduce the crash on Lassen.

Expected behavior Should work fine on every platform.

Platform (please complete the following information):

klevzoff commented 3 years ago

This stack trace is actually just the aftermath of geosx trying to quit, the error first happens earlier in the FluidUpdateKernel.

From cuda-gdb:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x55557849c4b0 (ArrayOfArraysView.hpp:409)

Thread 1 "geosx" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 37, block (0,0,0), thread (32,0,0), device 0, sm 0, warp 1, lane 0]
0x000055557849c4c0 in LvArray::ArrayOfArraysView<double const, int const, true, LvArray::ChaiBuffer>::operator[] (this=0x7fff3e615c08, i=0)
    at /home/klevtsov/GEOSX/src/coreComponents/LvArray/src/ArrayOfArraysView.hpp:409
409     return ArraySlice< T, 1, 0, INDEX_TYPE_NC >( m_values.data() + m_offsets[ i ], &m_sizes[ i ], nullptr );
(cuda-gdb) bt
#0  0x000055557849c4c0 in LvArray::ArrayOfArraysView<double const, int const, true, LvArray::ChaiBuffer>::operator[] (this=0x7fff3e615c08, i=0)
    at /home/klevtsov/GEOSX/src/coreComponents/LvArray/src/ArrayOfArraysView.hpp:409
#1  0x00005555787cb440 in geosx::TableFunction::KernelWrapper::compute<double*, double*>(double* const&, double&, double*&&) const (this=0x7fff3e615c00, input=<optimized out>, value=<optimized out>, 
    derivatives=<optimized out>) at /home/klevtsov/GEOSX/src/coreComponents/functions/TableFunction.hpp:338
#2  0x0000555578c48bf0 in geosx::constitutive::DeadOilFluidUpdate::computeViscosities (this=0x7fffe2fff518, pressure=5000000, phaseVisc=0x7fffe2ffea70, dPhaseVisc_dPres=0x7fffe2ffeaa0, 
    dPhaseVisc_dGlobalCompFrac=0x7fffe2ffeb00) at /home/klevtsov/GEOSX/src/coreComponents/constitutive/fluid/DeadOilFluid.hpp:571
#3  geosx::constitutive::DeadOilFluidUpdate::compute (this=0x7fffe2fff518, pressure=<optimized out>, temperature=<optimized out>, composition=0x7fffe2ffec68, phaseFraction=0x7fffe2ffe830, 
    dPhaseFraction_dPressure=0x7fffe2ffe860, dPhaseFraction_dTemperature=0x7fffe2ffe890, dPhaseFraction_dGlobalCompFraction=0x7fffe2ffe8c0, phaseDensity=0x7fffe2ffe8f0, 
    dPhaseDensity_dPressure=0x7fffe2ffe920, dPhaseDensity_dTemperature=0x7fffe2ffe950, dPhaseDensity_dGlobalCompFraction=0x7fffe2ffe980, phaseMassDensity=0x7fffe2ffe9b0, 
    dPhaseMassDensity_dPressure=0x7fffe2ffe9e0, dPhaseMassDensity_dTemperature=0x7fffe2ffea10, dPhaseMassDensity_dGlobalCompFraction=0x7fffe2ffea40, phaseViscosity=0x7fffe2ffea70, 
    dPhaseViscosity_dPressure=0x7fffe2ffeaa0, dPhaseViscosity_dTemperature=0x7fffe2ffead0, dPhaseViscosity_dGlobalCompFraction=0x7fffe2ffeb00, phaseCompFraction=0x7fffe2ffeb30, 
    dPhaseCompFraction_dPressure=0x7fffe2ffeb60, dPhaseCompFraction_dTemperature=0x7fffe2ffeb90, dPhaseCompFraction_dGlobalCompFraction=0x7fffe2ffebc0, totalDensity=<optimized out>, 
    dTotalDensity_dPressure=<optimized out>, dTotalDensity_dTemperature=<optimized out>, dTotalDensity_dGlobalCompFraction=0x7fffe2ffec38, pressure=<optimized out>, temperature=<optimized out>, 
    totalDensity=<optimized out>, dTotalDensity_dPressure=<optimized out>, dTotalDensity_dTemperature=<optimized out>) at /home/klevtsov/GEOSX/src/coreComponents/constitutive/fluid/DeadOilFluid.hpp:679
#4  geosx::constitutive::DeadOilFluidUpdate::update (this=<optimized out>, k=<optimized out>, q=<optimized out>, pressure=<optimized out>, temperature=<optimized out>, composition=0x7fffe2ffec68, 
    this=<optimized out>, k=<optimized out>, q=<optimized out>, pressure=<optimized out>, temperature=<optimized out>) at /home/klevtsov/GEOSX/src/coreComponents/constitutive/fluid/DeadOilFluid.hpp:175
#5  void geosx::CompositionalMultiphaseBaseKernels::FluidUpdateKernel::launch<RAJA::policy::cuda::cuda_exec<256ul, false>, geosx::constitutive::DeadOilFluidUpdate>(int, geosx::constitutive::DeadOilFluidUpdate const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, double, LvArray::ArrayView<double const, 2, 1, int, LvArray::ChaiBuffer> const&)::{lambda(int)#1}::operator()(int) const (this=<optimized out>, k=<optimized out>)
    at /home/klevtsov/GEOSX/src/coreComponents/physicsSolvers/fluidFlow/CompositionalMultiphaseBaseKernels.hpp:159
#6  0x000055557896f120 in RAJA::policy::cuda::impl::forall_cuda_kernel<256ul, RAJA::Iterators::numeric_iterator<int, int, int*>, void geosx::CompositionalMultiphaseBaseKernels::FluidUpdateKernel::launch<RAJA::policy::cuda::cuda_exec<256ul, false>, geosx::constitutive::DeadOilFluidUpdate>(int, geosx::constitutive::DeadOilFluidUpdate const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, double, LvArray::ArrayView<double const, 2, 1, int, LvArray::ChaiBuffer> const&)::{lambda(int)#1}, int>(void geosx::CompositionalMultiphaseBaseKernels::FluidUpdateKernel::launch<RAJA::policy::cuda::cuda_exec<256ul, false>, geosx::constitutive::DeadOilFluidUpdate>(int, geosx::constitutive::DeadOilFluidUpdate const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, double, LvArray::ArrayView<double const, 2, 1, int, LvArray::ChaiBuffer> const&)::{lambda(int)#1}, RAJA::Iterators::numeric_iterator<int, int, int*>, int)<<<(1,1,1),(256,1,1)>>> (loop_body=..., idx=..., length=75)
    at /home/geosx/thirdPartyLibs/install-gcc8-cuda-release/raja/include/RAJA/policy/cuda/forall.hpp:146

Basically, the TableFunction::KernelWrapper object is trying to get a slice of m_coordinates which is an ArrayOfArraysView. However, its buffers have not been properly moved onto device, for example trying to print m_coordinates.m_offsets.m_pointer[0] in the debugger throws a memory access error. Same for m_values.m_pointer[0].

All in all, it looks as if the TableFunction::KernelWrapper object has been bitwise-copied onto device, instead of properly copy-constructed. It just used to work on Lassen because of its magic access to host memory.

klevzoff commented 3 years ago

The root of the problem is the same issue @corbett5 had to work around for nested arrays. Whenever ChaiBuffer moves its data between memory spaces, it requests chai::ArrayManager to do the job, which in turn calls to umpire::ResourceManager, which does the memory transfer, but without invoking copy constructors on the objects being copied (it's a void * API, so no type information is available). ChaiBuffer then tries to call move( space, touch ) on its contained objects to deal with nested arrays correctly. I plugged the whole by adding a similar move( space, touch ) function to TableFunction::KernelWrapper (see https://github.com/GEOSX/GEOSX/pull/1397), but I don't know if that's a viable strategy going forward. We should probably still consider bringing something like chai::managed_ptr over.

corbett5 commented 3 years ago

@francoishamon when you get errors like this in the future try grabbing an allocation with lalloc --atsdisable which turns off the magic access to host memory that @klevzoff mentioned.

francoishamon commented 3 years ago

Ok I see. Thanks a lot for fixing the problem @klevzoff !