LLNL / RAJA

RAJA Performance Portability Layer (C++)
BSD 3-Clause "New" or "Revised" License
489 stars 103 forks source link

Revise RAJA plugin support #1742

Open adayton1 opened 1 month ago

adayton1 commented 1 month ago

Is your feature request related to a problem? Please describe.

RAJA plugins are used by CHAI to make sure the data backing ManagedArray is in the correct memory space and that it is up to date. However, the approach used now is not stream aware. This leads to suboptimal performance on GPU platforms. Where there is a dual memory space (CUDA), memory copies to the host are done on stream 0, which forces the whole device to synchronize. Where there is a single memory space (HIP), we have to do a synchronize across the whole device to make sure the data is valid during host accesses.

Describe the solution you'd like

Making CHAI stream aware would be relatively straightforward if the camp resource used by RAJA was passed as an argument to the plugin functions. Additionally, the postLaunch function should also receive an event with a wait method that CHAI can call when it needs to be sure the kernel has been completed.

Describe alternatives you've considered

Instead of modifying the plugin, RAJA could set some global state that is accessible when the plugin methods are called.

Additional context

Umpire is working on camp resource aware allocators (https://github.com/LLNL/Umpire/pull/901), which CHAI will also be using.

Also, note that even if only one stream is being used in an application, this new approach will be more efficient than synchronizing across the whole device.

adayton1 commented 1 month ago

Actually, passing an event to postLaunch might need some more design work. For preCapture, we set some global state that is then used by chai::ManagedArray's copy constructor. The global state is then reset in postCapture. I'm imagining analogous preDestroy and postDestroy functions that set and reset global state and chai::ManagedArray's destructor would operate off that global state. That could get tricky to do, though.

I'm also trying to re-evaluate if I can do some kind of registry pattern where chai::ManagedArrays register themselves or some callbacks and then the postLaunch could do the work without having to set some global state.

adayton1 commented 1 month ago

I've been operating on the assumption that an event can be waited on more than once. Is that true?

MrBurmark commented 1 month ago

I think you can wait multiple times on an event. Note that we will also have to be careful using events as camp never frees/destroys the underlying cuda/hip event currently. I've thought about making events move-only or having a shared pointer so they can automatically clean up after themselves. I would prefer move-only as shared pointer can be expensive, but it would be a breaking change in the API. @long58

rhornung67 commented 1 month ago

Target this for next RAJA Suite release.

rhornung67 commented 1 month ago

Talk to ALED and Spheral teams about move-only vs. reference counting for event tracking.

adayton1 commented 4 weeks ago

We would need reference counting in events since multiple CHAI ManagedArrays could track the same event.

adayton1 commented 3 weeks ago

If you wanted to avoid passing an event on the postLaunch method, you could have RAJA::forall avoid returning an event and allow the application to create their own event if desired (since they passed in the resource to begin with, they can easily create one).

MrBurmark commented 3 weeks ago

We return something this is convertible to an event from the forall, so if you don't use it then no event is created.

adayton1 commented 3 weeks ago

We return something this is convertible to an event from the forall, so if you don't use it then no event is created.

Nice! I guess we'll still have to be careful about both CHAI and the application creating separate events. Or is the overhead of creating an event pretty low?

MrBurmark commented 3 weeks ago

That is still a concern, we'd have to measure. I don't think many apps use events.

MrBurmark commented 3 weeks ago

Thinking about things like multiple ownership for events. I can imagine having events that contain a shared pointer to their contents or events that are move-only and those wanting multiple ownership semantics wrap them in a shared pointer.

MrBurmark commented 3 weeks ago

Right now generic resources always come with the overhead of a shared pointer. I've been lamenting that for a while. If we wanted to avoid this we could perhaps have something like shared_resource and shared_event classes and unique_resource and unique_event classes? That way you could choose as a user which semantics you want. Though we'd still have to decide what semantics the typed resources and events should have. Currently typed resources and events are basically views to a stream or event so they can be copied and destroyed without affecting the underlying stream or event. The lifetime of the underlying resources, cuda/hip streams and events, are basically static storage duration and none of them are ever freed.

adayton1 commented 3 weeks ago

We'll definitely need events to be cleaned up - in a single run I could easily see tens of thousands of them being created. I'm less worried about streams at the moment, though that's still good to consider.

adayton1 commented 3 weeks ago

Hmm, I'm not sure how to handle typed resources/events. I don't love the idea of a typed resource destroying a stream when it goes out of scope. Or for that matter, a typed event being waited on when the typed event goes out of scope. Though, if it didn't wait on the event, just released it (is that possible to do without waiting on it?), then that seems reasonable.

MrBurmark commented 2 weeks ago

I don't think you have to wait on events to free them in cuda/hip.