Closed francoishamon closed 3 years ago
This stack trace is actually just the aftermath of geosx trying to quit, the error first happens earlier in the FluidUpdateKernel
.
From cuda-gdb:
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x55557849c4b0 (ArrayOfArraysView.hpp:409)
Thread 1 "geosx" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 37, block (0,0,0), thread (32,0,0), device 0, sm 0, warp 1, lane 0]
0x000055557849c4c0 in LvArray::ArrayOfArraysView<double const, int const, true, LvArray::ChaiBuffer>::operator[] (this=0x7fff3e615c08, i=0)
at /home/klevtsov/GEOSX/src/coreComponents/LvArray/src/ArrayOfArraysView.hpp:409
409 return ArraySlice< T, 1, 0, INDEX_TYPE_NC >( m_values.data() + m_offsets[ i ], &m_sizes[ i ], nullptr );
(cuda-gdb) bt
#0 0x000055557849c4c0 in LvArray::ArrayOfArraysView<double const, int const, true, LvArray::ChaiBuffer>::operator[] (this=0x7fff3e615c08, i=0)
at /home/klevtsov/GEOSX/src/coreComponents/LvArray/src/ArrayOfArraysView.hpp:409
#1 0x00005555787cb440 in geosx::TableFunction::KernelWrapper::compute<double*, double*>(double* const&, double&, double*&&) const (this=0x7fff3e615c00, input=<optimized out>, value=<optimized out>,
derivatives=<optimized out>) at /home/klevtsov/GEOSX/src/coreComponents/functions/TableFunction.hpp:338
#2 0x0000555578c48bf0 in geosx::constitutive::DeadOilFluidUpdate::computeViscosities (this=0x7fffe2fff518, pressure=5000000, phaseVisc=0x7fffe2ffea70, dPhaseVisc_dPres=0x7fffe2ffeaa0,
dPhaseVisc_dGlobalCompFrac=0x7fffe2ffeb00) at /home/klevtsov/GEOSX/src/coreComponents/constitutive/fluid/DeadOilFluid.hpp:571
#3 geosx::constitutive::DeadOilFluidUpdate::compute (this=0x7fffe2fff518, pressure=<optimized out>, temperature=<optimized out>, composition=0x7fffe2ffec68, phaseFraction=0x7fffe2ffe830,
dPhaseFraction_dPressure=0x7fffe2ffe860, dPhaseFraction_dTemperature=0x7fffe2ffe890, dPhaseFraction_dGlobalCompFraction=0x7fffe2ffe8c0, phaseDensity=0x7fffe2ffe8f0,
dPhaseDensity_dPressure=0x7fffe2ffe920, dPhaseDensity_dTemperature=0x7fffe2ffe950, dPhaseDensity_dGlobalCompFraction=0x7fffe2ffe980, phaseMassDensity=0x7fffe2ffe9b0,
dPhaseMassDensity_dPressure=0x7fffe2ffe9e0, dPhaseMassDensity_dTemperature=0x7fffe2ffea10, dPhaseMassDensity_dGlobalCompFraction=0x7fffe2ffea40, phaseViscosity=0x7fffe2ffea70,
dPhaseViscosity_dPressure=0x7fffe2ffeaa0, dPhaseViscosity_dTemperature=0x7fffe2ffead0, dPhaseViscosity_dGlobalCompFraction=0x7fffe2ffeb00, phaseCompFraction=0x7fffe2ffeb30,
dPhaseCompFraction_dPressure=0x7fffe2ffeb60, dPhaseCompFraction_dTemperature=0x7fffe2ffeb90, dPhaseCompFraction_dGlobalCompFraction=0x7fffe2ffebc0, totalDensity=<optimized out>,
dTotalDensity_dPressure=<optimized out>, dTotalDensity_dTemperature=<optimized out>, dTotalDensity_dGlobalCompFraction=0x7fffe2ffec38, pressure=<optimized out>, temperature=<optimized out>,
totalDensity=<optimized out>, dTotalDensity_dPressure=<optimized out>, dTotalDensity_dTemperature=<optimized out>) at /home/klevtsov/GEOSX/src/coreComponents/constitutive/fluid/DeadOilFluid.hpp:679
#4 geosx::constitutive::DeadOilFluidUpdate::update (this=<optimized out>, k=<optimized out>, q=<optimized out>, pressure=<optimized out>, temperature=<optimized out>, composition=0x7fffe2ffec68,
this=<optimized out>, k=<optimized out>, q=<optimized out>, pressure=<optimized out>, temperature=<optimized out>) at /home/klevtsov/GEOSX/src/coreComponents/constitutive/fluid/DeadOilFluid.hpp:175
#5 void geosx::CompositionalMultiphaseBaseKernels::FluidUpdateKernel::launch<RAJA::policy::cuda::cuda_exec<256ul, false>, geosx::constitutive::DeadOilFluidUpdate>(int, geosx::constitutive::DeadOilFluidUpdate const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, double, LvArray::ArrayView<double const, 2, 1, int, LvArray::ChaiBuffer> const&)::{lambda(int)#1}::operator()(int) const (this=<optimized out>, k=<optimized out>)
at /home/klevtsov/GEOSX/src/coreComponents/physicsSolvers/fluidFlow/CompositionalMultiphaseBaseKernels.hpp:159
#6 0x000055557896f120 in RAJA::policy::cuda::impl::forall_cuda_kernel<256ul, RAJA::Iterators::numeric_iterator<int, int, int*>, void geosx::CompositionalMultiphaseBaseKernels::FluidUpdateKernel::launch<RAJA::policy::cuda::cuda_exec<256ul, false>, geosx::constitutive::DeadOilFluidUpdate>(int, geosx::constitutive::DeadOilFluidUpdate const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, double, LvArray::ArrayView<double const, 2, 1, int, LvArray::ChaiBuffer> const&)::{lambda(int)#1}, int>(void geosx::CompositionalMultiphaseBaseKernels::FluidUpdateKernel::launch<RAJA::policy::cuda::cuda_exec<256ul, false>, geosx::constitutive::DeadOilFluidUpdate>(int, geosx::constitutive::DeadOilFluidUpdate const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, double, LvArray::ArrayView<double const, 2, 1, int, LvArray::ChaiBuffer> const&)::{lambda(int)#1}, RAJA::Iterators::numeric_iterator<int, int, int*>, int)<<<(1,1,1),(256,1,1)>>> (loop_body=..., idx=..., length=75)
at /home/geosx/thirdPartyLibs/install-gcc8-cuda-release/raja/include/RAJA/policy/cuda/forall.hpp:146
Basically, the TableFunction::KernelWrapper
object is trying to get a slice of m_coordinates
which is an ArrayOfArraysView
. However, its buffers have not been properly moved onto device, for example trying to print m_coordinates.m_offsets.m_pointer[0]
in the debugger throws a memory access error. Same for m_values.m_pointer[0]
.
All in all, it looks as if the TableFunction::KernelWrapper
object has been bitwise-copied onto device, instead of properly copy-constructed. It just used to work on Lassen because of its magic access to host memory.
The root of the problem is the same issue @corbett5 had to work around for nested arrays. Whenever ChaiBuffer
moves its data between memory spaces, it requests chai::ArrayManager
to do the job, which in turn calls to umpire::ResourceManager
, which does the memory transfer, but without invoking copy constructors on the objects being copied (it's a void *
API, so no type information is available). ChaiBuffer
then tries to call move( space, touch )
on its contained objects to deal with nested arrays correctly. I plugged the whole by adding a similar move( space, touch )
function to TableFunction::KernelWrapper
(see https://github.com/GEOSX/GEOSX/pull/1397), but I don't know if that's a viable strategy going forward. We should probably still consider bringing something like chai::managed_ptr
over.
@francoishamon when you get errors like this in the future try grabbing an allocation with lalloc --atsdisable
which turns off the magic access to host memory that @klevzoff mentioned.
Ok I see. Thanks a lot for fixing the problem @klevzoff !
Describe the bug On GPU platforms other than Lassen, the integrated tests that use constitutive models based on
TableFunction
(dead-oil, relperms) fail at initialization with the following error:so I must have made a mistake somewhere, but did not notice it on Lassen (everything works there).
To Reproduce Steps to reproduce the behavior:
TableFunction
I did not manage to reproduce the crash on Lassen.Expected behavior Should work fine on every platform.
Platform (please complete the following information):
develop