Closed sethrj closed 1 year ago
Some notes so far while writing up the SciDAC proposal:
For reference, #384 substantially improved the early per-step performance when diagnostics were enabled
@esseivaju will post some of his initial results here and we'll close this in favor of more granular issues.
actions-cms2018+field+msc-vecgeom-gpu.pdf actions-cms2018-vecgeom-gpu.pdf
For reference, here are the CMS2018 times for each of the actions: it's dominated by the along-step kernel.
along_step_uniform_msc_kernel
Test problem: cms2018 +field +msc
Celeritas version: 7c3f5f72ec4a5afbf184f69b5c5da51e10233af8
Dependencies: geant4@11.0.3
, vecgeom@1.2.1
Compilers: gcc 11.2.0, nvcc 11.8.89
Platform: Perlmutter compute node; AMD EPYC 7763 / NVIDIA A100 (sm_80)
The kernel is memory bound, however, both compute and memory throughput are low, reaching ~10% compute throughput and 6% memory throughput at best.
The kernel has a low occupancy, limited by the # of registers/thread of 255. With a block size of 256, we have 1 resident block per SM, or 8 warps which is 12.5% of the # threads / SM for this GPU; ~38 waves for a grid size of 4096.
Trying to limit the number or register using __launch_bounds__()
it will fail to compile, listing the following vecgeom functions as registers pressure points:
//requires 139 regs
vecgeom::cuda::VSafetyEstimatorHelper<vecgeom::cuda::BVHSafetyEstimator>::ComputeSafety(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::NavStateIndex const&) const
//requires 157 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyconeImplementation<vecgeom::cuda::ConeTypes::UniversalCone>, vecgeom::cuda::UnplacedPolycone>::DistanceToOut(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::Vector3D<double> const&, double) const
//requires 163 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyconeImplementation<vecgeom::cuda::ConeTypes::UniversalCone>, vecgeom::cuda::UnplacedPolycone>::DistanceToIn(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::Vector3D<double> const&, double) const
//requires 164 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyhedronImplementation<(EInnerRadii)0, (EPhiCutout)0>, vecgeom::cuda::VUnplacedVolume>::DistanceToOut(vecgeom::cuda::Vector3D<double> const&, vecgeom::cuda::Vector3D<double> const&, double) const
//requires 220 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::PolyhedronImplementation<(EInnerRadii)0, (EPhiCutout)0>, vecgeom::cuda::VUnplacedVolume>::SafetyToOut(vecgeom::cuda::Vector3D<double> const&) const
// requires 254 regs
vecgeom::cuda::CommonUnplacedVolumeImplHelper<vecgeom::cuda::GenTrapImplementation, vecgeom::cuda::VUnplacedVolume>::DistanceToOut(const vecgeom::cuda::Vector3D<double> &, const vecgeom::cuda::Vector3D<double> &, double) const
With 8 warps/SM, we have 2 warps/scheduler. Regardless of whether we're at the start, middle or end of the simulation, scheduler issues on average one instruction every 10 cycles, 90% of the cycles we have no eligible warps.
One reason is the low occupancy. Another is that load balancing is a challenge is MC Simulation and some threads will finish earlier than others leading to less active warps. Masking threads assigned to dead tracks also contributes to load imbalance. Warp state highlights some potential challenges:
Early on in the simulation, threads are mostly waiting on memory transfer. With an average of 20 cycles/instructions, 12 are stalled waiting on a memory operation. Cache hit rate for L1/L2 is ~60% / 40%, respectively. Thread divergence is also an issue, even early one in the simulation when most tracks are still alive; we have on average ~20 active threads per instruction.
Later in the simulation, cache hit rate improves, ~85%/96% for L1/L2. It makes sense since less threads are active, resulting in less data in cache. Thread divergence deteriorates, we drop to ~3 active threads per warp; early thread completion for dead tracks is probably the main reason for it. With still ~17 cycles/instruction on average, warps are now spending half their time, ~9 cycles waiting for the next instruction to be fetched.
@esseivaju This is excellent! It's a little surprising to me that the GenTrap implementation is the long pole in the tent: digging in a little bit, it's likely due to SecondOrderSurfaceShell<4>::ComputeSminSmax
which seems to have 24 local doubles in addition to the 8 doubles called by SecondOrderSurfaceShell<4>::DistanceToOut
. Perhaps the VecGeom team can comment on that Monday...
If it's just the geometry that's slow, we might be able to break the along-step kernel into three kernels, of which only the middle interacts with the geometry.
Great work with these initial results, and we'll resume with https://github.com/celeritas-project/celeritas/issues/675 .
With an essentially fully functional EM simulation in place, dig into the performance characteristics, experiment with opportunities for improvement, and publish the results of those as we did for the KN demo.