Closed roelof-groenewald closed 3 months ago
attn @mhaseeb123 (no need to respond, Remi will summarize the plan)
Thank you for the investigation @roelof-groenewald. If possible, can you please label the data points (collision time info) or perhaps attach plots in a comment without log (y-axis). It's usually difficult to infer severity of differences with log-scale axes. x-axis may stay in log scale though. Thanks!
Though first thoughts: looks like the loop over collision pairs is faster for smaller and non-uniform grids and vice versa so we can empirically infer a knee-point to choose the collision type. I would also be partial to using one or a combination of heuristics to determine this.
Another idea: we can use one algorithm for one step and the other for the second step and then choose the one that took the least amount of time for the rest of the steps.
Thanks, these are good points @mhaseeb123 At this point, our plan is to first understand better the overheads that the new algorithm (loop over independent pairs) introduces. In particular, I am wondering whether the binary search that was introduced here: https://github.com/mhaseeb123/WarpX/blob/ff7da3f8da30e3f69ba4399c8d848886a19be8d8/Source/Particles/Collision/BinaryCollision/BinaryCollision.H#L398 is responsible for most of the overheads of the new algorithm.
I observed the same issue on CPUS. The wall time was scaling with Nppc squared at large value of Nppc. The main issue seems to be that there are order(Nppc) operations being called inside UpdateMomentumPerezElastic(), which is called for each binary pair, which itself is an order(Nppc) operation. The net cost of the binary collision method thus scales like Nppc squared.
I have a PR that will be submitted soon that fixes this issue. Below are scaling results for this test problem on CPUs. Red is the development branch. Blue is the binary opt branch (the new PR coming), and yellow is the old branch that does loops over cells rather than binary pairs (58e6b8def...)
Here are the results for running on Lassen GPUs courtesy @dpgrote, showing similar trends as on Perlmutter.
The new PR performs better than the previous method that looped over cells using an 8^3 grid on 1 GPU. It is about the same for the 32^3 grid.
Fix coming via #5066
The recent improvements to the binary collision parallelization strategy (https://github.com/ECP-WarpX/WarpX/pull/4577) has shown excellent speed-up for high particle-per-cell cases where particle density is not uniform. However, recent testing with uniform plasmas showed that the new scheme performs somewhat worse than the previous one. From @RemiLehe:
The input file below produced the following performance trends on Perlmutter GPU nodes ("loop over collision pairs" is the new scheme):