celeritas-project / celeritas

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.
https://celeritas-project.github.io/celeritas/user/index.html
Other
63 stars 34 forks source link

Add support for NVHPC `-stdpar` #505

Open sethrj opened 2 years ago

sethrj commented 2 years ago

Explore auto-parallelization using Nvidia's PGI-derived NVHPC tool suite. We can track development issues on here.

Our initial path is just to modify the host code pathways so that they always run on device, and later we'll cleanly support both hose and device dispatch.

Issues (newest first)

memory access error

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

### CAUGHT SIGNAL: 6 ### address: 0x3ea0003faa38,  signal =  SIGABRT, value =    6, description = abort program (formerly SIGIOT).

running through cuda-gdb:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x18c3a40 (ProcessPrimariesLauncher.hh:55)

Thread 1 "celeritas_globa" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x00000000018c3a50 in celeritas::detail::ProcessPrimariesLauncher<(celeritas::MemSpace)0>::operator() (tid=...)
    at /home/users/s3j/.local/src/celeritas/src/celeritas/track/detail/ProcessPrimariesLauncher.hh:55
55      TrackInitializer& init    = data_.initializers[ThreadId(

This is because data_ is a reference to memory on the host stack. We're going to have to change all our kernel calls to either:

invalid validate

unreachable unreachable

atomics

atomics!

demo interactor resize

"/home/users/s3j/.local/src/celeritas/app/demo-interactor/HostKNDemoRunner.cc", line 87: error: no instance of overloaded function "resize" matches the argument list
            argument types are: (demo_interactor::DetectorStateData<celeritas::Ownership::value, celeritas::MemSpace::host> *, demo_interactor::DetectorParamsData, demo_interactor::KNDemoRunArgs::size_type)
      resize(&detector_states, detector_params, args.max_steps);

just skip the demo interactor for now

unsupported procedure

Geant4 build

Warnings

Fixed numerous warnings in https://github.com/celeritas-project/celeritas/pull/486

Test failures

@pcanal dug down on some slight floating point differences between vanilla GCC and stdpar: we're making incorrectly strict assumptions about floating point behavior in a couple of our unit tests: 2e04478ea9831b5222d6ac53374f333d1cfa7677

sethrj commented 2 years ago

We've decided to suspend work on this for now: if AMD hints at having experimental support for automatic offloading (or something like it) then it will definitely be worth reopening to investigate it as a portability layer.

AlexVlx commented 2 months ago

We've decided to suspend work on this for now: if AMD hints at having experimental support for automatic offloading (or something like it) then it will definitely be worth reopening to investigate it as a portability layer.

I'm not sure if this is still of interest, but if it is we've added support for fairly symmetric functionality, please see here and here. We'd definitely be interested in cooperating:)

sethrj commented 2 months ago

Hey @AlexVlx that's great! Our team is a little overloaded at the moment, but this would be a great project for an intern to implement? We're going to try to bring in more people next year onto our team, and if you have any summer students (or heck, winter students!) we'd love to get in touch and help get this effort off the ground.

mcolg commented 2 months ago

You should also explore using heterogenous memory management (HMM) since it allows the device to access static host memory, including stack objects. It's best used on systems with high-speed links, such as NVLink on Grace Hopper systems, but works, albeit slower, over PCIe connections. This article, which I co-authored, might help as well.

sethrj commented 2 months ago

Thanks @mcolg ! Since the time that we first explored this, we did some substantial refactoring of how we launch kernels (see #743 and #783) to fix various odd behaviors we saw on multiple platforms due to passing too much data as a kernel launch argument. I think we'll encounter many fewer problems next time we try...