Replace StreamStore and helpers with reduction function

celeritas-project / celeritas

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.

Other

63 stars 34 forks source link

The StreamStore is no longer needed since #1278, except that diagnostic classes still use it for doing a reduction over multiple states for output. I think we need to break apart this functionality since we really don't want the params to be mutable and keep access to multiple states.

Add an EndRunActionInterface that takes a the core params, a core state, and a Span<CoreState<M> const*> of all states for performing a reduction in a multithread context. The action itself should know whether to do a global reduction or otherwise: probably it should always reduce to StreamId{0}. The action should check that state.stream_id() < all_states.size() && all_states[state.stream_id().get()] == &state. We probably want to add an MPI communicator to the core params so that we can perform reductions with dynamic parallelism.
Define an output adapter that can take a state (or aux state data plus memspace?) and write that at the end of the program. This requires lifetime considerations: we either want the state itself to be shared outside of the stepper, or the aux state vec should become a shared pointer. I'm leaning toward the latter...

Once we do this we can also get rid of the max_streams parameter.
The reduction capability can later be extended so that in addition to begin/end run, we can have begin/end batch: see #809
With reductions, we may need to be more careful about parallel execution. The begin_run action is taken when a Stepper is created, which (for celer-sim) may happen separately in warmup and inside the parallel loop, or (for the accel interface) will happen during the run manager's BeginOfRunAction call. We should make sure the end_run action is executed in parallel...

This is going to be troublesome for the different ways that we execute across threads. It's easy if we're doing OpenMP and know that all states are going to be starting and stopping at a synchronization point, it's easy to send a vector of state references when each thread is finished but the states are still allocated. However, if we're running through Geant4 MT, the "EndOfRunAction" will be called individually on each thread and then on the "master" thread. But we have to deallocate the state on the original thread, which creates an ordering issue.

Perhaps instead of trying to make our destructor ordering work with Geant4's threading model, we add special cases for anything that has to add Geant4 objects:

Hit manager
Navigation states (if using geant geometry)

Now that we have a LocalTransporter we could have it manage shared pointers to the hit processors, and then give weak pointers (for safety) + raw pointers (for performance, knowing that since the hit processes are only "shared" within a single thread we don't have to use locking) to the hit manager.

We also need to add an aux state vector interface to the StepInterface::process_steps so that step processors can have stateful data without collection mirrors.

So the order of this will be:

Change hit processor ownership so it's managed by the local transporter but a weak pointer is kept by the
Have the local transporter register the states (or the stepper) with the main "shared params" so that the states can be merged and finalized at once.
Add a special case in the LocalTransporter for deallocating geant4 geometry states on thread (?)
Then we can start making more components stateful and gatherable: action timers, calorimeters, etc.

celeritas-project / celeritas

Replace StreamStore and helpers with reduction function #1296