LLNL / axom

CS infrastructure components for HPC applications
BSD 3-Clause "New" or "Revised" License
152 stars 27 forks source link

axom::Array constructor crash on CUDA. #1432

Open BradWhitlock opened 2 hours ago

BradWhitlock commented 2 hours ago

Code like the following resulted in Array::Array trying to initialize elements of a device-allocated array using placement new on the host. The code SEGV'd.

using ExecSpace = axom::CUDA_EXEC<256>;
const int allocatorID = axom::execution_space<ExecSpace>::allocatorID();
axom::Array<int> arr(n, n, allocatorID);

This method calls initialize() with 2 arguments, making the 3rd argument the detault of true, which is to default-construct. https://github.com/LLNL/axom/blob/70b360815ebed6a25e6ae369bc5efeaa58cacdbc/src/axom/core/Array.hpp#L1084

https://github.com/LLNL/axom/blob/70b360815ebed6a25e6ae369bc5efeaa58cacdbc/src/axom/core/Array.hpp#L1591

I think the root of the problem could be that Array::m_executeOnGPU is not initialized anywhere. Valgrind was logging uninitialized memory in this area and m_executeOnGPU is probably the culprit.

Calling axom::Array(n, n, allocatorID) where allocatorID is a CUDA allocator should not cause a SEGV and it should initialize the data as needed on device.

I was told that ATS might have some bearing here too.

zansel7{whitlocb}103: detect_ats
rzansel7     ATS detected
BradWhitlock commented 2 hours ago

I'm trying a fix to set m_executeOnGPU based on the memory space, inside of Array::initialize, Array::initialize_from_other, and one of the constructors that calls neither of those methods.

bmhan12 commented 2 hours ago

As mentioned previously:

Link to documentation on setting/disabling the Address Translation Services (ATS), and checking if it is enabled/disabled (Point 19): https://lc.llnl.gov/confluence/display/SIERRA/Quickstart+Guide

rhornung67 commented 2 hours ago

Thanks @BradWhitlock . We can have @publixsubfan look into this to make sure other issues don't occur.