Closed BradWhitlock closed 1 month ago
I'm trying a fix to set m_executeOnGPU based on the memory space, inside of Array::initialize, Array::initialize_from_other, and one of the constructors that calls neither of those methods.
As mentioned previously:
Link to documentation on setting/disabling the Address Translation Services (ATS), and checking if it is enabled/disabled (Point 19): https://lc.llnl.gov/confluence/display/SIERRA/Quickstart+Guide
Thanks @BradWhitlock . We can have @publixsubfan look into this to make sure other issues don't occur.
Update. I've had some trouble reproducing the crash on develop. The Array::m_executeOnGPU member is uninitialized but it does not seem to matter much. When it fails in my branch, it seems like some bad optimization might be at work. I was getting the allocatorID to pass from execution_space\<ExecSpace>::allocatorID() and it seemed (in Totalview) that the allocatorID was getting optimized out. If I make it "volatile" to prevent inlining then I can see it returns 3 and it works normally. The code resembles:
void buildShapeMap(axom::ArrayView<axom::IndexType> &values, axom::ArrayView<axom::IndexType> &ids, int allocatorID)
{
const axom::IndexType n = // get the size
values = axom::Array<IndexType>(n, n, allocatorID);
ids = axom::Array<IndexType>(n, n, allocatorID);
// Fill values, ids here
}
...
/*volatile*/ int allocatorID = axom::execution_space<ExecSpace>::allocatorID();
axom::Array<axom::IndexType> values, ids;
buildShapeMap(values, ids, allocatorID);
Yes, I believe we need to initialize m_executeOnGPU
to an appropriate default value. Good catch @BradWhitlock.
But is this happening with CUDA device-only memory? The value of that variable should be immaterial -- we should be passing through to special logic for that case.
Code like the following resulted in Array::Array trying to initialize elements of a device-allocated array using placement new on the host. The code SEGV'd.
This method calls initialize() with 2 arguments, making the 3rd argument the detault of true, which is to default-construct. https://github.com/LLNL/axom/blob/70b360815ebed6a25e6ae369bc5efeaa58cacdbc/src/axom/core/Array.hpp#L1084
https://github.com/LLNL/axom/blob/70b360815ebed6a25e6ae369bc5efeaa58cacdbc/src/axom/core/Array.hpp#L1591
I think the root of the problem could be that Array::m_executeOnGPU is not initialized anywhere. Valgrind was logging uninitialized memory in this area and m_executeOnGPU is probably the culprit.
Calling axom::Array(n, n, allocatorID) where allocatorID is a CUDA allocator should not cause a SEGV and it should initialize the data as needed on device.
I was told that ATS might have some bearing here too.