3D3V Thermal Benchmark with 0.4.2 on P100

ax3l commented 6 years ago

This just documents a little benchmark for @RemiLehe et al.

PIConGPU Version

0.4.2 (11/2018)

Backend & Hardware

CUDA 9.2.88
driver 396.26
one Nvidia Tesla P100-SXM2-16GB, PCI-E version
Hemera Cluster (HZDR)

Used Software

module gcc/7.3.0 
module cmake/3.11.3 
module cuda/9.2 
module openmpi/2.1.2-cuda92 
module boost/1.68.0 
module zlib/1.2.11 
module c-blosc/1.14.4 
module adios/1.13.1-cuda92 
module hdf5-parallel/1.8.20-cuda92 
module libsplash/1.7.0-cuda92 
module libpng/1.6.35 
module pngwriter/0.7.0

picongpu --version:

PIConGPU: 0.4.2
  Build-Type: Release

Third party:
  OS:         Linux-3.10.0-693.11.6.el7.x86_64
  arch:       x86_64
  CXX:        GNU (7.3.0)
  CMake:      3.11.3
  CUDA:       9.2.88
  mallocMC:   2.3.0
  Boost:      1.68.0
  MPI:        
    standard: 3.1
    flavor:   OpenMPI (2.1.6)
  PNGwriter:  0.7.0
  libSplash:  1.7.0 (Format 4.0)
  ADIOS:      1.13.1

Setup

See here for the full input parameters:

3D3V
PCS (3rd order, pc. cubic)
e+/e- 4 particles per cell each
dx = dy = dz = 0.3e-6 m
dt = Courant limit
n = 1.e25 m^{-3} (Density of each species)
vx_th = vy_th = vz_th = 0.1 x c (Maxwellian distribution); PIConGPU: 51keV / k_B
single & double precision
Boris push
Yee solver
periodic for all boundaries
1000 steps

Important Notes

4 particles per species and cell is considered "not much" for PMacc.

Our memory structures are scaling super-linear up to a few dozen particles per cell, on which it starts scaling linear, in order to be well suited for high-resolution overdense plasmas, photons, ionization electrons, etc. to scale well. A low number of particles per species can under-utilize these.

Cases (Variants)

0: 32bit floats, optimized trajectory Esirkepov current deposition ("EmZ", unpublished)
1: 64bit floats, optimized trajectory Esirkepov current deposition ("EmZ", unpublished)
2: 32bit floats, optimized, regular Esirkepov current deposition
3: 64bit floats, optimized, regular Esirkepov current deposition

Case 0 is the most realistic one that I would choose for an underdense plasma simulation as well. Everything else is for comparisons with other implementations.

Status

[x] first draft
[x] still adjusting the setup
[ ] verifying
[ ] benchmarks run

Preliminary Results

Case 0

macro particles: 268'435'456
cells: 33'554'432 (512x256x256)

PCS: (3rd order, pc. cubic) run at node: gp002 init time: 146 s main simulation time: 887 s time per time step: 887 ms ns / particle and time step: 3.30 ns particles advanced / device and second: 302 million

TSC: (2nd order, pc. quadratic) run at node: gp001 init time: 146 s main simulation time: 473 s time per time step: 473 ms ns / particle and time step: 1.76 ns particles advanced / device and second: 568 million

CIC: (1st order, pc. linear) run at node: gp001 init time: 146 s main simulation time: 264 s time per time step: 264 ms ns / particle and time step: 0.98 ns particles advanced / device and second: 1.02 billion

Case 1

macro particles: 134'217'728
cells: 16'777'216 (256x256x256)

PCS: (3rd order, pc. cubic)

[ ] have to reduce parallelism to run this, see #2816

TSC: (2nd order, pc. quadratic) run at node: gp001 init time: 70 s main simulation time: 384 s time per time step: 384 ms ns / particle and time step: 2.86 ns particles advanced / device and second: 350 million

CIC: (1st order, pc. linear) run at node: gp001 init time: 70 s main simulation time: 198 s time per time step: 0.198 ms ns / particle and time step: 1.48 ns particles advanced / device and second: 678 million

Case 2

macro particles: 268'435'456
cells: 33'554'432 (512x256x256)

PCS: (3rd order, pc. cubic) run at node: gp001 init time: 146 s main simulation time: 1383 s time per time step: 1383 ms ns / particle and time step: 5.15 ns particles advanced / device and second: 194 million

TSC: (2nd order, pc. quadratic) run at node: gp002 init time: 146 s main simulation time: 549 s time per time step: 548 ms ns / particle and time step: 2.04 ns particles advanced / device and second: 490 million

CIC: (1st order, pc. linear) run at node: gp001 init time: 146 s main simulation time: 305 s time per time step: 305 ms ns / particle and time step: 1.14 ns particles advanced / device and second: 880 million

Case 3

...

Note on Metrics

Dear reader, please be aware that the commonly used metric "ns / particle and time step" in our community is not super general, to put it mildly. Nevertheless, this metric gives you an idea how well two implementations utilize the same given hardware.

If you want to compare different hardware, normalize again by theoretical Flop/s to know how well two implementations compare in sense of hardware utilization.

To be honest, what counts to me is "time to solution" aka "time to scientific results", which with core PIConGPU (on P100) and 3D3V runs is in the order 2 to 10 hours (20k-100k iterations) when fully subscribed and significantly faster when doing strong scaling, where we can get out about an order of magnitude in speedup (not shown in this benchmark), up to about 20 iterations / second in 3D3V.

Output Data

Currently stored at HZDR under /bigdata/hplsim/scratch/huebl/thermalBench/

RemiLehe commented 6 years ago

Thanks for posting those benchmarks! Could you add the number of cells in each dimension? Also, could you confirm that this is on a single GPU?

ax3l commented 6 years ago

You are welcome :) Yes, one GPU with 512x256x256 cells: .cfg file Full input directory is also linked in the PR description under Setup, I will make the link more prominent.

sbastrakov commented 6 years ago

@ax3l sorry for nitpicking, but what do you mean by "super-linear scaling" in the important note? To me the natural meaning of that phrase is actually the opposite of the conclusion you state. In my mind if smth has super-linear scaling, then the costs grow more than what is proportional to the increase in ppc, which is our case is the opposite (till reasonably large amount).

ax3l commented 6 years ago

https://en.wikipedia.org/wiki/Speedup#Super-linear_speedup

sbastrakov commented 6 years ago

Ah, now I see your point. Just to clarify my point: to me the association worked like "memory structures" + "super-linear scaling" -> data structures + super-linear complexity.

ax3l commented 6 years ago

Got ya, added the link to clarify :) Do you mind if I minimize our comments as off-topic to collapse the thread? :)

sbastrakov commented 6 years ago

@ax3l please do.

sbastrakov commented 5 years ago

cc @steindev

ax3l commented 4 years ago

@psychocoderHPC Another V100 test with 2x16ppc tuned for maximum particles per GPU (memory footprint estimates): https://github.com/ax3l/picongpu/tree/topic-20200213-thermalBench/share/picongpu/examples/ThermalBenchmark

steindev commented 3 years ago

Is there any reason to keep this issue open?

PrometheusPi commented 3 years ago

I am pretty certain, this can be closed.

psychocoderHPC commented 3 years ago

This issue was opened to define an example for comparison between different PIC codes. If I remember correctly validation is required too. I do not think we should close the issue, it is still a todo to define a reproducible example that can be validated for correctness and compared with other codes.

ComputationalRadiationPhysics / picongpu