Improve physics memory utilization and performance

In an earlier iteration of Celeritas we pushed all physics "interactions" to a single vector (one per track) and then applied them all simultaneously. We saw slightly worse performance, but easier logic, when we changed the code so that the InteractionApplier updates the track.

With @esseivaju 's async allocators I think we should consider revisiting this by asynchronously allocating space for secondaries and interactions between the pre-post and post steps, having a post-post kernel update all the tracks with their interaction at once, and deallocate the buffers after after. This would also slightly improve the logic in the PreStepExecutor, which requires launching on all threads to reset the secondary initializer count. I think it should also improve kernel occupancy (and reduce code size) for the model kernels.

celeritas-project / celeritas

Improve physics memory utilization and performance #1292