ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
710 stars 218 forks source link

3D3V Thermal Benchmark with 0.4.2 on P100 #2815

Open ax3l opened 6 years ago

ax3l commented 6 years ago

This just documents a little benchmark for @RemiLehe et al.

PIConGPU Version

0.4.2 (11/2018)

Backend & Hardware

Used Software

module gcc/7.3.0 
module cmake/3.11.3 
module cuda/9.2 
module openmpi/2.1.2-cuda92 
module boost/1.68.0 
module zlib/1.2.11 
module c-blosc/1.14.4 
module adios/1.13.1-cuda92 
module hdf5-parallel/1.8.20-cuda92 
module libsplash/1.7.0-cuda92 
module libpng/1.6.35 
module pngwriter/0.7.0

picongpu --version:

PIConGPU: 0.4.2
  Build-Type: Release

Third party:
  OS:         Linux-3.10.0-693.11.6.el7.x86_64
  arch:       x86_64
  CXX:        GNU (7.3.0)
  CMake:      3.11.3
  CUDA:       9.2.88
  mallocMC:   2.3.0
  Boost:      1.68.0
  MPI:        
    standard: 3.1
    flavor:   OpenMPI (2.1.6)
  PNGwriter:  0.7.0
  libSplash:  1.7.0 (Format 4.0)
  ADIOS:      1.13.1

Setup

See here for the full input parameters:

Important Notes

4 particles per species and cell is considered "not much" for PMacc.

Our memory structures are scaling super-linear up to a few dozen particles per cell, on which it starts scaling linear, in order to be well suited for high-resolution overdense plasmas, photons, ionization electrons, etc. to scale well. A low number of particles per species can under-utilize these.

Cases (Variants)

Case 0 is the most realistic one that I would choose for an underdense plasma simulation as well. Everything else is for comparisons with other implementations.

Status

Preliminary Results

Case 0

PCS: (3rd order, pc. cubic) run at node: gp002 init time: 146 s main simulation time: 887 s time per time step: 887 ms ns / particle and time step: 3.30 ns particles advanced / device and second: 302 million

TSC: (2nd order, pc. quadratic) run at node: gp001 init time: 146 s main simulation time: 473 s time per time step: 473 ms ns / particle and time step: 1.76 ns particles advanced / device and second: 568 million

CIC: (1st order, pc. linear) run at node: gp001 init time: 146 s main simulation time: 264 s time per time step: 264 ms ns / particle and time step: 0.98 ns particles advanced / device and second: 1.02 billion

Case 1

PCS: (3rd order, pc. cubic)

TSC: (2nd order, pc. quadratic) run at node: gp001 init time: 70 s main simulation time: 384 s time per time step: 384 ms ns / particle and time step: 2.86 ns particles advanced / device and second: 350 million

CIC: (1st order, pc. linear) run at node: gp001 init time: 70 s main simulation time: 198 s time per time step: 0.198 ms ns / particle and time step: 1.48 ns particles advanced / device and second: 678 million

Case 2

PCS: (3rd order, pc. cubic) run at node: gp001 init time: 146 s main simulation time: 1383 s time per time step: 1383 ms ns / particle and time step: 5.15 ns particles advanced / device and second: 194 million

TSC: (2nd order, pc. quadratic) run at node: gp002 init time: 146 s main simulation time: 549 s time per time step: 548 ms ns / particle and time step: 2.04 ns particles advanced / device and second: 490 million

CIC: (1st order, pc. linear) run at node: gp001 init time: 146 s main simulation time: 305 s time per time step: 305 ms ns / particle and time step: 1.14 ns particles advanced / device and second: 880 million

Case 3

...

Note on Metrics

Dear reader, please be aware that the commonly used metric "ns / particle and time step" in our community is not super general, to put it mildly. Nevertheless, this metric gives you an idea how well two implementations utilize the same given hardware.

If you want to compare different hardware, normalize again by theoretical Flop/s to know how well two implementations compare in sense of hardware utilization.

To be honest, what counts to me is "time to solution" aka "time to scientific results", which with core PIConGPU (on P100) and 3D3V runs is in the order 2 to 10 hours (20k-100k iterations) when fully subscribed and significantly faster when doing strong scaling, where we can get out about an order of magnitude in speedup (not shown in this benchmark), up to about 20 iterations / second in 3D3V.

Output Data

Currently stored at HZDR under /bigdata/hplsim/scratch/huebl/thermalBench/

RemiLehe commented 6 years ago

Thanks for posting those benchmarks! Could you add the number of cells in each dimension? Also, could you confirm that this is on a single GPU?

ax3l commented 6 years ago

You are welcome :) Yes, one GPU with 512x256x256 cells: .cfg file Full input directory is also linked in the PR description under Setup, I will make the link more prominent.

sbastrakov commented 6 years ago

@ax3l sorry for nitpicking, but what do you mean by "super-linear scaling" in the important note? To me the natural meaning of that phrase is actually the opposite of the conclusion you state. In my mind if smth has super-linear scaling, then the costs grow more than what is proportional to the increase in ppc, which is our case is the opposite (till reasonably large amount).

ax3l commented 6 years ago

https://en.wikipedia.org/wiki/Speedup#Super-linear_speedup

sbastrakov commented 6 years ago

Ah, now I see your point. Just to clarify my point: to me the association worked like "memory structures" + "super-linear scaling" -> data structures + super-linear complexity.

ax3l commented 6 years ago

Got ya, added the link to clarify :) Do you mind if I minimize our comments as off-topic to collapse the thread? :)

sbastrakov commented 6 years ago

@ax3l please do.

sbastrakov commented 5 years ago

cc @steindev

ax3l commented 4 years ago

@psychocoderHPC Another V100 test with 2x16ppc tuned for maximum particles per GPU (memory footprint estimates): https://github.com/ax3l/picongpu/tree/topic-20200213-thermalBench/share/picongpu/examples/ThermalBenchmark

steindev commented 3 years ago

Is there any reason to keep this issue open?

PrometheusPi commented 3 years ago

I am pretty certain, this can be closed.

psychocoderHPC commented 3 years ago

This issue was opened to define an example for comparison between different PIC codes. If I remember correctly validation is required too. I do not think we should close the issue, it is still a todo to define a reproducible example that can be validated for correctness and compared with other codes.