sethrj commented 2 years ago

Explore auto-parallelization using Nvidia's PGI-derived NVHPC tool suite. We can track development issues on here.

Our initial path is just to modify the host code pathways so that they always run on device, and later we'll cleanly support both hose and device dispatch.

[x] Install geant4
[x] unsupported procedure
[ ] references to stack variables
[ ] ...

Issues (newest first)

memory access error

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

### CAUGHT SIGNAL: 6 ### address: 0x3ea0003faa38,  signal =  SIGABRT, value =    6, description = abort program (formerly SIGIOT).

running through cuda-gdb:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x18c3a40 (ProcessPrimariesLauncher.hh:55)

Thread 1 "celeritas_globa" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x00000000018c3a50 in celeritas::detail::ProcessPrimariesLauncher<(celeritas::MemSpace)0>::operator() (tid=...)
    at /home/users/s3j/.local/src/celeritas/src/celeritas/track/detail/ProcessPrimariesLauncher.hh:55
55      TrackInitializer& init    = data_.initializers[ThreadId(

This is because data_ is a reference to memory on the host stack. We're going to have to change all our kernel calls to either:

allocate all of our arguments with "global memory" (which might be more efficient for HIP anyway?)
store launcher data by value and make sure we're not accidentally passing references into the stdpar launch calls

invalid validate

celeritas: internal assertion failed: CELER_VALIDATE cannot be called from device code" thrown in the test fixture's constructor.
Use if target magic to conditionally compile for host

unreachable unreachable

nvlink error : Undefined reference to '__builtin_unreachable' in 'src/CMakeFiles/celeritas.dir/celeritas/em/generated/BetheHeitlerInteract.cc.o'
Just don't define it when building stdpar (of course now we get lots of invalid noreturn warnings)

atomics

Since CUDA/HIP are disabled size_type was defaulting to size_t instead of unsigned int
Then src/libceleritas.so: undefined reference toatomicAdd(unsigned int*, unsigned int)'` due to host code also referencing it
Use if target magic

atomics!

demo interactor resize

"/home/users/s3j/.local/src/celeritas/app/demo-interactor/HostKNDemoRunner.cc", line 87: error: no instance of overloaded function "resize" matches the argument list
            argument types are: (demo_interactor::DetectorStateData<celeritas::Ownership::value, celeritas::MemSpace::host> *, demo_interactor::DetectorParamsData, demo_interactor::KNDemoRunArgs::size_type)
      resize(&detector_states, detector_params, args.max_steps);

just skip the demo interactor for now

unsupported procedure

@paulromano got errors while trying to build InitTracks.cc: NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unsupported procedure
@mcolg tracked this down to a CELER_VALIDATE
I've updated CELER_DEVICE_COMPILE to act as though we're in "device compile" mode when using -stdpar 98122dc9952f3790a3ebb079b4732585a05a3ed5

Geant4 build

Geant4 threads are incompatible (nvhpc doesn't like static thread_local in template classes)
Recursive template instantiation depth is too small
Patched spack with https://github.com/spack/spack/pull/32185
Fixed upstream geant4 as emdna-V11-00-25

Warnings

Fixed numerous warnings in https://github.com/celeritas-project/celeritas/pull/486

Test failures

@pcanal dug down on some slight floating point differences between vanilla GCC and stdpar: we're making incorrectly strict assumptions about floating point behavior in a couple of our unit tests: 2e04478ea9831b5222d6ac53374f333d1cfa7677

sethrj commented 2 years ago

We've decided to suspend work on this for now: if AMD hints at having experimental support for automatic offloading (or something like it) then it will definitely be worth reopening to investigate it as a portability layer.

AlexVlx commented 2 months ago

We've decided to suspend work on this for now: if AMD hints at having experimental support for automatic offloading (or something like it) then it will definitely be worth reopening to investigate it as a portability layer.

I'm not sure if this is still of interest, but if it is we've added support for fairly symmetric functionality, please see here and here. We'd definitely be interested in cooperating:)

sethrj commented 2 months ago

Hey @AlexVlx that's great! Our team is a little overloaded at the moment, but this would be a great project for an intern to implement? We're going to try to bring in more people next year onto our team, and if you have any summer students (or heck, winter students!) we'd love to get in touch and help get this effort off the ground.

mcolg commented 2 months ago

You should also explore using heterogenous memory management (HMM) since it allows the device to access static host memory, including stack objects. It's best used on systems with high-speed links, such as NVLink on Grace Hopper systems, but works, albeit slower, over PCIe connections. This article, which I co-authored, might help as well.