celeritas-project / celeritas

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.
https://celeritas-project.github.io/celeritas/user/index.html
Other
62 stars 33 forks source link

Replace StreamStore and helpers with reduction function #1296

Open sethrj opened 3 months ago

sethrj commented 3 months ago

The StreamStore is no longer needed since #1278, except that diagnostic classes still use it for doing a reduction over multiple states for output. I think we need to break apart this functionality since we really don't want the params to be mutable and keep access to multiple states.

  1. Add an EndRunActionInterface that takes a the core params, a core state, and a Span<CoreState<M> const*> of all states for performing a reduction in a multithread context. The action itself should know whether to do a global reduction or otherwise: probably it should always reduce to StreamId{0}. The action should check that state.stream_id() < all_states.size() && all_states[state.stream_id().get()] == &state. We probably want to add an MPI communicator to the core params so that we can perform reductions with dynamic parallelism.
  2. Define an output adapter that can take a state (or aux state data plus memspace?) and write that at the end of the program. This requires lifetime considerations: we either want the state itself to be shared outside of the stepper, or the aux state vec should become a shared pointer. I'm leaning toward the latter...
sethrj commented 3 weeks ago

This is going to be troublesome for the different ways that we execute across threads. It's easy if we're doing OpenMP and know that all states are going to be starting and stopping at a synchronization point, it's easy to send a vector of state references when each thread is finished but the states are still allocated. However, if we're running through Geant4 MT, the "EndOfRunAction" will be called individually on each thread and then on the "master" thread. But we have to deallocate the state on the original thread, which creates an ordering issue.

Perhaps instead of trying to make our destructor ordering work with Geant4's threading model, we add special cases for anything that has to add Geant4 objects:

Now that we have a LocalTransporter we could have it manage shared pointers to the hit processors, and then give weak pointers (for safety) + raw pointers (for performance, knowing that since the hit processes are only "shared" within a single thread we don't have to use locking) to the hit manager.

We also need to add an aux state vector interface to the StepInterface::process_steps so that step processors can have stateful data without collection mirrors.

So the order of this will be:

  1. Change hit processor ownership so it's managed by the local transporter but a weak pointer is kept by the
  2. Have the local transporter register the states (or the stepper) with the main "shared params" so that the states can be merged and finalized at once.
  3. Add a special case in the LocalTransporter for deallocating geant4 geometry states on thread (?)
  4. Then we can start making more components stateful and gatherable: action timers, calorimeters, etc.