Open sethrj opened 2 years ago
We've decided to suspend work on this for now: if AMD hints at having experimental support for automatic offloading (or something like it) then it will definitely be worth reopening to investigate it as a portability layer.
We've decided to suspend work on this for now: if AMD hints at having experimental support for automatic offloading (or something like it) then it will definitely be worth reopening to investigate it as a portability layer.
I'm not sure if this is still of interest, but if it is we've added support for fairly symmetric functionality, please see here and here. We'd definitely be interested in cooperating:)
Hey @AlexVlx that's great! Our team is a little overloaded at the moment, but this would be a great project for an intern to implement? We're going to try to bring in more people next year onto our team, and if you have any summer students (or heck, winter students!) we'd love to get in touch and help get this effort off the ground.
You should also explore using heterogenous memory management (HMM) since it allows the device to access static host memory, including stack objects. It's best used on systems with high-speed links, such as NVLink on Grace Hopper systems, but works, albeit slower, over PCIe connections. This article, which I co-authored, might help as well.
Thanks @mcolg ! Since the time that we first explored this, we did some substantial refactoring of how we launch kernels (see #743 and #783) to fix various odd behaviors we saw on multiple platforms due to passing too much data as a kernel launch argument. I think we'll encounter many fewer problems next time we try...
Explore auto-parallelization using Nvidia's PGI-derived NVHPC tool suite. We can track development issues on here.
Our initial path is just to modify the host code pathways so that they always run on device, and later we'll cleanly support both hose and device dispatch.
Issues (newest first)
memory access error
running through cuda-gdb:
This is because
data_
is a reference to memory on the host stack. We're going to have to change all our kernel calls to either:invalid validate
celeritas: internal assertion failed: CELER_VALIDATE cannot be called from device code" thrown in the test fixture's constructor.
if target
magic to conditionally compile for hostunreachable unreachable
nvlink error : Undefined reference to '__builtin_unreachable' in 'src/CMakeFiles/celeritas.dir/celeritas/em/generated/BetheHeitlerInteract.cc.o'
atomics
size_type
was defaulting tosize_t
instead ofunsigned int
src/libceleritas.so: undefined reference to
atomicAdd(unsigned int*, unsigned int)'` due to host code also referencing itif target
magicdemo interactor resize
just skip the demo interactor for now
unsupported procedure
NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unsupported procedure
CELER_VALIDATE
CELER_DEVICE_COMPILE
to act as though we're in "device compile" mode when using-stdpar
98122dc9952f3790a3ebb079b4732585a05a3ed5Geant4 build
static thread_local
in template classes)emdna-V11-00-25
Warnings
Fixed numerous warnings in https://github.com/celeritas-project/celeritas/pull/486
Test failures
@pcanal dug down on some slight floating point differences between vanilla GCC and stdpar: we're making incorrectly strict assumptions about floating point behavior in a couple of our unit tests: 2e04478ea9831b5222d6ac53374f333d1cfa7677