celeritas-project / celeritas

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.
https://celeritas-project.github.io/celeritas/
Other
64 stars 35 forks source link

Performance analysis and optimizations for demo loop #284

Closed sethrj closed 1 year ago

sethrj commented 3 years ago

With an essentially fully functional EM simulation in place, dig into the performance characteristics, experiment with opportunities for improvement, and publish the results of those as we did for the KN demo.

sethrj commented 2 years ago

Some notes so far while writing up the SciDAC proposal:

sethrj commented 2 years ago

For reference, #384 substantially improved the early per-step performance when diagnostics were enabled

sethrj commented 1 year ago

@esseivaju will post some of his initial results here and we'll close this in favor of more granular issues.

sethrj commented 1 year ago

actions-cms2018+field+msc-vecgeom-gpu.pdf actions-cms2018-vecgeom-gpu.pdf

For reference, here are the CMS2018 times for each of the actions: it's dominated by the along-step kernel.

esseivaju commented 1 year ago

Initial profiling of along_step_uniform_msc_kernel

Version control

Test problem: cms2018 +field +msc Celeritas version: 7c3f5f72ec4a5afbf184f69b5c5da51e10233af8 Dependencies: geant4@11.0.3, vecgeom@1.2.1 Compilers: gcc 11.2.0, nvcc 11.8.89 Platform: Perlmutter compute node; AMD EPYC 7763 / NVIDIA A100 (sm_80)

GPU throughput

The kernel is memory bound, however, both compute and memory throughput are low, reaching ~10% compute throughput and 6% memory throughput at best.

Occupancy

The kernel has a low occupancy, limited by the # of registers/thread of 255. With a block size of 256, we have 1 resident block per SM, or 8 warps which is 12.5% of the # threads / SM for this GPU; ~38 waves for a grid size of 4096.

Trying to limit the number or register using __launch_bounds__() it will fail to compile, listing the following vecgeom functions as registers pressure points:

//requires 139 regs
vecgeom::cuda::VSafetyEstimatorHelper<vecgeom::cuda::BVHSafetyEstimator>::ComputeSafety(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::NavStateIndex const&) const

//requires 157 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyconeImplementation<vecgeom::cuda::ConeTypes::UniversalCone>, vecgeom::cuda::UnplacedPolycone>::DistanceToOut(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::Vector3D<double> const&, double) const

//requires 163 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyconeImplementation<vecgeom::cuda::ConeTypes::UniversalCone>, vecgeom::cuda::UnplacedPolycone>::DistanceToIn(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::Vector3D<double> const&, double) const

//requires 164 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyhedronImplementation<(EInnerRadii)0, (EPhiCutout)0>, vecgeom::cuda::VUnplacedVolume>::DistanceToOut(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::Vector3D<double> const&, double) const

//requires 220 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyhedronImplementation<(EInnerRadii)0, (EPhiCutout)0>, vecgeom::cuda::VUnplacedVolume>::SafetyToOut(vecgeom::cuda::Vector3D<double> const&) const

// requires 254 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::GenTrapImplementation, vecgeom::cuda::VUnplacedVolume>::DistanceToOut(const vecgeom::cuda::Vector3D<double> &, const vecgeom::cuda::Vector3D<double> &, double) const 

Warp state & scheduler utilization

With 8 warps/SM, we have 2 warps/scheduler. Regardless of whether we're at the start, middle or end of the simulation, scheduler issues on average one instruction every 10 cycles, 90% of the cycles we have no eligible warps.

image

One reason is the low occupancy. Another is that load balancing is a challenge is MC Simulation and some threads will finish earlier than others leading to less active warps. Masking threads assigned to dead tracks also contributes to load imbalance. Warp state highlights some potential challenges:

Early on in the simulation, threads are mostly waiting on memory transfer. With an average of 20 cycles/instructions, 12 are stalled waiting on a memory operation. Cache hit rate for L1/L2 is ~60% / 40%, respectively. Thread divergence is also an issue, even early one in the simulation when most tracks are still alive; we have on average ~20 active threads per instruction.

image

Later in the simulation, cache hit rate improves, ~85%/96% for L1/L2. It makes sense since less threads are active, resulting in less data in cache. Thread divergence deteriorates, we drop to ~3 active threads per warp; early thread completion for dead tracks is probably the main reason for it. With still ~17 cycles/instruction on average, warps are now spending half their time, ~9 cycles waiting for the next instruction to be fetched.

image
sethrj commented 1 year ago

@esseivaju This is excellent! It's a little surprising to me that the GenTrap implementation is the long pole in the tent: digging in a little bit, it's likely due to SecondOrderSurfaceShell<4>::ComputeSminSmax which seems to have 24 local doubles in addition to the 8 doubles called by SecondOrderSurfaceShell<4>::DistanceToOut. Perhaps the VecGeom team can comment on that Monday...

If it's just the geometry that's slow, we might be able to break the along-step kernel into three kernels, of which only the middle interacts with the geometry.

Great work with these initial results, and we'll resume with https://github.com/celeritas-project/celeritas/issues/675 .