Open ax3l opened 6 years ago
Thanks for posting those benchmarks! Could you add the number of cells in each dimension? Also, could you confirm that this is on a single GPU?
You are welcome :) Yes, one GPU with 512x256x256 cells: .cfg file Full input directory is also linked in the PR description under Setup, I will make the link more prominent.
@ax3l sorry for nitpicking, but what do you mean by "super-linear scaling" in the important note? To me the natural meaning of that phrase is actually the opposite of the conclusion you state. In my mind if smth has super-linear scaling, then the costs grow more than what is proportional to the increase in ppc, which is our case is the opposite (till reasonably large amount).
Ah, now I see your point. Just to clarify my point: to me the association worked like "memory structures" + "super-linear scaling" -> data structures + super-linear complexity.
Got ya, added the link to clarify :) Do you mind if I minimize our comments as off-topic to collapse the thread? :)
@ax3l please do.
cc @steindev
@psychocoderHPC Another V100 test with 2x16ppc tuned for maximum particles per GPU (memory footprint estimates): https://github.com/ax3l/picongpu/tree/topic-20200213-thermalBench/share/picongpu/examples/ThermalBenchmark
Is there any reason to keep this issue open?
I am pretty certain, this can be closed.
This issue was opened to define an example for comparison between different PIC codes. If I remember correctly validation is required too. I do not think we should close the issue, it is still a todo to define a reproducible example that can be validated for correctness and compared with other codes.
This just documents a little benchmark for @RemiLehe et al.
PIConGPU Version
0.4.2 (11/2018)
Backend & Hardware
Used Software
picongpu --version:
Setup
See here for the full input parameters:
Important Notes
4 particles per species and cell is considered "not much" for PMacc.
Our memory structures are scaling super-linear up to a few dozen particles per cell, on which it starts scaling linear, in order to be well suited for high-resolution overdense plasmas, photons, ionization electrons, etc. to scale well. A low number of particles per species can under-utilize these.
Cases (Variants)
0
: 32bit floats, optimized trajectory Esirkepov current deposition ("EmZ", unpublished)1
: 64bit floats, optimized trajectory Esirkepov current deposition ("EmZ", unpublished)2
: 32bit floats, optimized, regular Esirkepov current deposition3
: 64bit floats, optimized, regular Esirkepov current depositionCase
0
is the most realistic one that I would choose for an underdense plasma simulation as well. Everything else is for comparisons with other implementations.Status
Preliminary Results
Case 0
PCS: (3rd order, pc. cubic) run at node: gp002 init time: 146 s main simulation time: 887 s time per time step: 887 ms ns / particle and time step: 3.30 ns particles advanced / device and second: 302 million
TSC: (2nd order, pc. quadratic) run at node: gp001 init time: 146 s main simulation time: 473 s time per time step: 473 ms ns / particle and time step: 1.76 ns particles advanced / device and second: 568 million
CIC: (1st order, pc. linear) run at node: gp001 init time: 146 s main simulation time: 264 s time per time step: 264 ms ns / particle and time step: 0.98 ns particles advanced / device and second: 1.02 billion
Case 1
PCS: (3rd order, pc. cubic)
TSC: (2nd order, pc. quadratic) run at node: gp001 init time: 70 s main simulation time: 384 s time per time step: 384 ms ns / particle and time step: 2.86 ns particles advanced / device and second: 350 million
CIC: (1st order, pc. linear) run at node: gp001 init time: 70 s main simulation time: 198 s time per time step: 0.198 ms ns / particle and time step: 1.48 ns particles advanced / device and second: 678 million
Case 2
PCS: (3rd order, pc. cubic) run at node: gp001 init time: 146 s main simulation time: 1383 s time per time step: 1383 ms ns / particle and time step: 5.15 ns particles advanced / device and second: 194 million
TSC: (2nd order, pc. quadratic) run at node: gp002 init time: 146 s main simulation time: 549 s time per time step: 548 ms ns / particle and time step: 2.04 ns particles advanced / device and second: 490 million
CIC: (1st order, pc. linear) run at node: gp001 init time: 146 s main simulation time: 305 s time per time step: 305 ms ns / particle and time step: 1.14 ns particles advanced / device and second: 880 million
Case 3
...
Note on Metrics
Dear reader, please be aware that the commonly used metric "ns / particle and time step" in our community is not super general, to put it mildly. Nevertheless, this metric gives you an idea how well two implementations utilize the same given hardware.
If you want to compare different hardware, normalize again by theoretical Flop/s to know how well two implementations compare in sense of hardware utilization.
To be honest, what counts to me is "time to solution" aka "time to scientific results", which with core PIConGPU (on P100) and 3D3V runs is in the order 2 to 10 hours (20k-100k iterations) when fully subscribed and significantly faster when doing strong scaling, where we can get out about an order of magnitude in speedup (not shown in this benchmark), up to about 20 iterations / second in 3D3V.
Output Data
Currently stored at HZDR under
/bigdata/hplsim/scratch/huebl/thermalBench/