Binary collision performance

roelof-groenewald commented 7 months ago

The recent improvements to the binary collision parallelization strategy (https://github.com/ECP-WarpX/WarpX/pull/4577) has shown excellent speed-up for high particle-per-cell cases where particle density is not uniform. However, recent testing with uniform plasmas showed that the new scheme performs somewhat worse than the previous one. From @RemiLehe:

The new implementation is exposing more parallelism, but introduces new overheads. In cases where there is already enough parallelism in the old implementation (large number of cells), there is no gain from exposing more parallelism, and maybe then the overheads dominate. The next step would probably be to take a closer look at the overheads and their relative cost.

The input file below produced the following performance trends on Perlmutter GPU nodes ("loop over collision pairs" is the new scheme):

#################################
####### GENERAL PARAMETERS ######
#################################
max_step = 100
amr.n_cell =  8 8 8
amr.max_grid_size_x = 32
amr.max_grid_size_y = 32
amr.max_grid_size = 64
amr.blocking_factor = 8
amr.max_level = 0
geometry.dims = 3
geometry.prob_lo = -20.e-6   -20.e-6   -20.e-6    # physical domain
geometry.prob_hi =  20.e-6    20.e-6    20.e-6

#################################
####### Boundary condition ######
#################################
boundary.field_lo = periodic periodic periodic
boundary.field_hi = periodic periodic periodic

#################################
############ NUMERICS ###########
#################################
warpx.serialize_initial_conditions = 1
warpx.verbose = 1
warpx.cfl = 1.0

# Order of particle shape factors
algo.particle_shape = 1

#################################
############ PLASMA #############
#################################
particles.species_names = electrons ions

electrons.species_type = electron
electrons.injection_style = "nrandompercell"
electrons.num_particles_per_cell = ppc
electrons.profile = constant
electrons.density = 1.e25  # number of electrons per m^3
electrons.momentum_distribution_type = "gaussian"
electrons.ux_th  = 0.01 # uth the std of the (unitless) momentum
electrons.uy_th  = 0.01 # uth the std of the (unitless) momentum
electrons.uz_th  = 0.01 # uth the std of the (unitless) momentum

ions.species_type = hydrogen
ions.injection_style = "nrandompercell"
ions.num_particles_per_cell = ppc
ions.profile = constant
ions.density = 1.e25  # number of electrons per m^3
ions.momentum_distribution_type = "gaussian"
ions.ux_th  = 0.0001 # uth the std of the (unitless) momentum
ions.uy_th  = 0.0001 # uth the std of the (unitless) momentum
ions.uz_th  = 0.0001 # uth the std of the (unitless) momentum

# Collisions
collisions.collision_names = cei # cee cii
cei.species = electrons ions
# cee.species = electrons electrons
# cii.species = ions ions
cei.CoulombLog = 10.0
# cee.CoulombLog = 10.0
# cii.CoulombLog = 10.0

ax3l commented 6 months ago

attn @mhaseeb123 (no need to respond, Remi will summarize the plan)

mhaseeb123 commented 6 months ago

Thank you for the investigation @roelof-groenewald. If possible, can you please label the data points (collision time info) or perhaps attach plots in a comment without log (y-axis). It's usually difficult to infer severity of differences with log-scale axes. x-axis may stay in log scale though. Thanks!

Though first thoughts: looks like the loop over collision pairs is faster for smaller and non-uniform grids and vice versa so we can empirically infer a knee-point to choose the collision type. I would also be partial to using one or a combination of heuristics to determine this.

Another idea: we can use one algorithm for one step and the other for the second step and then choose the one that took the least amount of time for the rest of the steps.

RemiLehe commented 6 months ago

Thanks, these are good points @mhaseeb123 At this point, our plan is to first understand better the overheads that the new algorithm (loop over independent pairs) introduces. In particular, I am wondering whether the binary search that was introduced here: https://github.com/mhaseeb123/WarpX/blob/ff7da3f8da30e3f69ba4399c8d848886a19be8d8/Source/Particles/Collision/BinaryCollision/BinaryCollision.H#L398 is responsible for most of the overheads of the new algorithm.

JustinRayAngus commented 4 months ago

I observed the same issue on CPUS. The wall time was scaling with Nppc squared at large value of Nppc. The main issue seems to be that there are order(Nppc) operations being called inside UpdateMomentumPerezElastic(), which is called for each binary pair, which itself is an order(Nppc) operation. The net cost of the binary collision method thus scales like Nppc squared.

I have a PR that will be submitted soon that fixes this issue. Below are scaling results for this test problem on CPUs. Red is the development branch. Blue is the binary opt branch (the new PR coming), and yellow is the old branch that does loops over cells rather than binary pairs (58e6b8def...)

scatter3D_scaling_cpu

JustinRayAngus commented 4 months ago

Here are the results for running on Lassen GPUs courtesy @dpgrote, showing similar trends as on Perlmutter.

The new PR performs better than the previous method that looped over cells using an 8^3 grid on 1 GPU. It is about the same for the 32^3 grid.

scatter3D_scaling_gpu

ax3l commented 4 months ago

Fix coming via #5066

ECP-WarpX / WarpX

Binary collision performance #4896