ECP-WarpX / WarpX

WarpX is an advanced electromagnetic & electrostatic Particle-In-Cell code.
https://ecp-warpx.github.io
Other
301 stars 191 forks source link

GPU vs CPU Benchmark #1136

Open bzdjordje opened 4 years ago

bzdjordje commented 4 years ago

Is there a best way to compare performance between GPU and CPU? Is there a specific way to compile the two different main2d files? Everything should be the same between the two except for USE_GPU. I am also using USE_OMP=TRUE and compiling on gcc/7.3.1. Currently GPU is running 5x slower than on CPU. My example outputs are as follow:

GPU: Initializing CUDA... CUDA initialized with 20 GPU(s) and 90 ranks. MPI initialized with 90 MPI processes MPI initialized with thread support level 0 AMReX (20.06-96-gd93db0a9500d) initialized ... STEP 1000 ends. TIME = 4.38635098e-13 DT = 4.38635098e-16 Walltime = 201.8242962 s; This step = 0.181442859 s; Avg. per step = 0.2018242962 s Writing plotfile diags/diag101000 Total Time : 214.0729109

CPU MPI initialized with 90 MPI processes MPI initialized with thread support level 0 OMP initialized with 3 OMP threads AMReX (20.06-96-gd93db0a9500d-dirty) initialized ... Total Time : 43.27789566

As you can see the GPU instance is initialized with 20 GPUs and 0 threads, while CPU has 3 OMP threads. These two simulations were initialized on 90 processes over 5 nodes.

MaxThevenet commented 4 years ago

Thanks for sharing the results @bzdjordje, and great to see that you have been able to run WarpX on CPU and GPU :).

In general, you can run the same simulation on CPU and GPU, but the parallelization may be different in order to utilize resources efficiently. Some information on how to run WarpX on a typical CPU platform (Cori@NERSC) and a typical GPU platform (Summit@OLCF) can be found on this page https://warpx.readthedocs.io/en/latest/running_cpp/platforms.html.

Could you provide additional info?

We will be able to help more with more data. In the meantime, I can already see that:

I am sure we can make the CPU and the GPU runs perform better!!

bzdjordje commented 4 years ago

Hi @MaxThevenet, good to hear from you again! Apologies for the delay, I have been having some issue recompiling WarpX again on GPU, being addressed in issue #1132.

Regarding your other requests, I am trying to compile/run on the following two machines: Lassen (GPU): https://hpc.llnl.gov/hardware/platforms/lassen (seems to be similar to Summit) Quartz (CPU): https://hpc.llnl.gov/hardware/platforms/Quartz

An example input file has been attached below: inputs_comp.txt

Submission scripts for the previous two machines (saved as .txt files): Lassen: rung.txt Quartz: runc.txt

Output files: Lassen: gpu_1216027.out.txt Quartz: cpu_1215996.out.txt

For Quartz the maximum for a simulation in debug mode is 30 minutes on 90 processes over 5 modes and so that is why those parameters were chosen. I am not exactly sure yet if this is the same for Lassen but kept that the same for the run given nevertheless.

MaxThevenet commented 4 years ago

Ok, thanks for the details. I think the main issue here is that this 2D problem is far too small for the number of nodes you are using. A few scans could help us find optimal conditions, but I suspect that a good starting point would be:

bzdjordje commented 4 years ago

Ok, sorry for the delay, I had trouble recompiling the GPU version of WarpX per issue #1132 but it seems to work again now.

So I configured the Lassen/GPU and Quartz/CPU simulations per your recommendations and I did indeed see a GPU speed up this time! 11s vs 17s. Is there a simple way to scale this for a bigger problem? For example this problem had about 32,000 cells, 25 particles-per-cell, and a 30 fs laser, but our actually baseline, Quasi-1D problem has more like a minimum of 300,000 cells, 1000 ppc, and a 200 fs laser.

Lassen/GPU: lassen_gpu_out.txt

Quartz/CPU: quartz_cpu_out.txt