sethrj commented 3 years ago

With an essentially fully functional EM simulation in place, dig into the performance characteristics, experiment with opportunities for improvement, and publish the results of those as we did for the KN demo.

sethrj commented 2 years ago

Some notes so far while writing up the SciDAC proposal:

353 by @amandalund resulted in a 15% speedup and substantial reduction in the number of step iterations.
377 The time difference (factor of 2+) for transporting the initial tracks vs the final tracks may be due to atomic collisions for allocating secondaries. Early on in the optimization tests, try preallocating a secondary for each Interaction.
We might be able to increase the total number of tracks in flight by sorting the initializers by increasing energy, preferring to transport low-energy ones.

sethrj commented 2 years ago

For reference, #384 substantially improved the early per-step performance when diagnostics were enabled

sethrj commented 1 year ago

@esseivaju will post some of his initial results here and we'll close this in favor of more granular issues.

sethrj commented 1 year ago

actions-cms2018+field+msc-vecgeom-gpu.pdf actions-cms2018-vecgeom-gpu.pdf

For reference, here are the CMS2018 times for each of the actions: it's dominated by the along-step kernel.

esseivaju commented 1 year ago

Initial profiling of `along_step_uniform_msc_kernel`

Version control

Test problem: cms2018 +field +msc Celeritas version: 7c3f5f72ec4a5afbf184f69b5c5da51e10233af8 Dependencies: geant4@11.0.3, vecgeom@1.2.1 Compilers: gcc 11.2.0, nvcc 11.8.89 Platform: Perlmutter compute node; AMD EPYC 7763 / NVIDIA A100 (sm_80)

GPU throughput

The kernel is memory bound, however, both compute and memory throughput are low, reaching ~10% compute throughput and 6% memory throughput at best.

Occupancy

The kernel has a low occupancy, limited by the # of registers/thread of 255. With a block size of 256, we have 1 resident block per SM, or 8 warps which is 12.5% of the # threads / SM for this GPU; ~38 waves for a grid size of 4096.

Trying to limit the number or register using __launch_bounds__() it will fail to compile, listing the following vecgeom functions as registers pressure points:

//requires 139 regs
vecgeom::cuda::VSafetyEstimatorHelper<vecgeom::cuda::BVHSafetyEstimator>::ComputeSafety(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::NavStateIndex const&) const

//requires 157 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyconeImplementation<vecgeom::cuda::ConeTypes::UniversalCone>, vecgeom::cuda::UnplacedPolycone>::DistanceToOut(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::Vector3D<double> const&, double) const

//requires 163 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyconeImplementation<vecgeom::cuda::ConeTypes::UniversalCone>, vecgeom::cuda::UnplacedPolycone>::DistanceToIn(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::Vector3D<double> const&, double) const

//requires 164 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyhedronImplementation<(EInnerRadii)0, (EPhiCutout)0>, vecgeom::cuda::VUnplacedVolume>::DistanceToOut(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::Vector3D<double> const&, double) const

//requires 220 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyhedronImplementation<(EInnerRadii)0, (EPhiCutout)0>, vecgeom::cuda::VUnplacedVolume>::SafetyToOut(vecgeom::cuda::Vector3D<double> const&) const

// requires 254 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::GenTrapImplementation, vecgeom::cuda::VUnplacedVolume>::DistanceToOut(const vecgeom::cuda::Vector3D<double> &, const vecgeom::cuda::Vector3D<double> &, double) const

Warp state & scheduler utilization

With 8 warps/SM, we have 2 warps/scheduler. Regardless of whether we're at the start, middle or end of the simulation, scheduler issues on average one instruction every 10 cycles, 90% of the cycles we have no eligible warps.

One reason is the low occupancy. Another is that load balancing is a challenge is MC Simulation and some threads will finish earlier than others leading to less active warps. Masking threads assigned to dead tracks also contributes to load imbalance. Warp state highlights some potential challenges:

Early on in the simulation, threads are mostly waiting on memory transfer. With an average of 20 cycles/instructions, 12 are stalled waiting on a memory operation. Cache hit rate for L1/L2 is ~60% / 40%, respectively. Thread divergence is also an issue, even early one in the simulation when most tracks are still alive; we have on average ~20 active threads per instruction.

Later in the simulation, cache hit rate improves, ~85%/96% for L1/L2. It makes sense since less threads are active, resulting in less data in cache. Thread divergence deteriorates, we drop to ~3 active threads per warp; early thread completion for dead tracks is probably the main reason for it. With still ~17 cycles/instruction on average, warps are now spending half their time, ~9 cycles waiting for the next instruction to be fetched.

sethrj commented 1 year ago

@esseivaju This is excellent! It's a little surprising to me that the GenTrap implementation is the long pole in the tent: digging in a little bit, it's likely due to SecondOrderSurfaceShell<4>::ComputeSminSmax which seems to have 24 local doubles in addition to the 8 doubles called by SecondOrderSurfaceShell<4>::DistanceToOut. Perhaps the VecGeom team can comment on that Monday...

If it's just the geometry that's slow, we might be able to break the along-step kernel into three kernels, of which only the middle interacts with the geometry.

Great work with these initial results, and we'll resume with https://github.com/celeritas-project/celeritas/issues/675 .

celeritas-project / celeritas

Performance analysis and optimizations for demo loop #284

353 by @amandalund resulted in a 15% speedup and substantial reduction in the number of step iterations.

377 The time difference (factor of 2+) for transporting the initial tracks vs the final tracks may be due to atomic collisions for allocating secondaries. Early on in the optimization tests, try preallocating a secondary for each Interaction.

Initial profiling of `along_step_uniform_msc_kernel`

Version control

GPU throughput

Occupancy

Warp state & scheduler utilization

celeritas-project / celeritas

Performance analysis and optimizations for demo loop #284

353 by @amandalund resulted in a 15% speedup and substantial reduction in the number of step iterations.

377 The time difference (factor of 2+) for transporting the initial tracks vs the final tracks may be due to atomic collisions for allocating secondaries. Early on in the optimization tests, try preallocating a secondary for each Interaction.

Initial profiling of along_step_uniform_msc_kernel

Version control

GPU throughput

Occupancy

Warp state & scheduler utilization

Initial profiling of `along_step_uniform_msc_kernel`