cms-patatrack / pixeltrack-standalone

Standalone Patatrack pixel tracking
Apache License 2.0
17 stars 35 forks source link

[alpaka] Fix CachingAllocator for pinned host memory in presence of multiple devices #316

Closed makortel closed 2 years ago

makortel commented 2 years ago

I noticed that alpaka (and alpakatest) was crashing with

Processing 1000 events, of which 2 concurrently, with 2 threads.
terminate called after throwing an instance of 'std::runtime_error'
  what():  .../pixeltrack-standalone/external/alpaka/include/alpaka/event/EventUniformCudaHipRt.hpp(157) 'cudaEventRecord( event.m_spEventImpl->m_UniformCudaHipEvent, queue.m_spQueueImpl->m_UniformCudaHipQueue)' returned error  : 'cudaErrorInvalidResourceHandle': 'invalid resource handle'!

when run on 2 GPUs. I managed to trace the problem into the pinned host memory allocations in CachingAllocator, when a pinned host memory block was previously used on device A, and is then used on device B.

The current code does

// Take previous device from *the argument block*
auto const& previousDevice = block.device();
// Take the queue of the argument block
auto queue = std::move(*(block.queue));
// Copy the information of the block to be reused
block = iBlock->second;
// Assign the queue originally from the argument block
block.queue = std::move(queue);

Since the block.device() returns the device of its Queue, this means that both block.device() and previousDevice point to the same device, and the subsequent code to create a new Event never runs.

A minimal fix would be

-auto const& previousDevice = block.device();
+auto const& previousDevice = iBlock->second.device();

but given that the point is to create a new Event if the device of the Queue accessing the block differs from the device of the Event that the block had, I think testing explicitly that condition makes the intention of the code more clear.

makortel commented 2 years ago

@fwyzard I'll propagate the fix to alpakatest after we agree on it.

fwyzard commented 2 years ago

Thanks @makortel , the change looks good - thanks for catching this and preparing a fix.