GPU vs CPU Benchmark - Githubissues

bzdjordje commented 4 years ago

Is there a best way to compare performance between GPU and CPU? Is there a specific way to compile the two different main2d files? Everything should be the same between the two except for USE_GPU. I am also using USE_OMP=TRUE and compiling on gcc/7.3.1. Currently GPU is running 5x slower than on CPU. My example outputs are as follow:

GPU: Initializing CUDA... CUDA initialized with 20 GPU(s) and 90 ranks. MPI initialized with 90 MPI processes MPI initialized with thread support level 0 AMReX (20.06-96-gd93db0a9500d) initialized ... STEP 1000 ends. TIME = 4.38635098e-13 DT = 4.38635098e-16 Walltime = 201.8242962 s; This step = 0.181442859 s; Avg. per step = 0.2018242962 s Writing plotfile diags/diag101000 Total Time : 214.0729109

CPU MPI initialized with 90 MPI processes MPI initialized with thread support level 0 OMP initialized with 3 OMP threads AMReX (20.06-96-gd93db0a9500d-dirty) initialized ... Total Time : 43.27789566

As you can see the GPU instance is initialized with 20 GPUs and 0 threads, while CPU has 3 OMP threads. These two simulations were initialized on 90 processes over 5 nodes.

MaxThevenet commented 4 years ago

Thanks for sharing the results @bzdjordje, and great to see that you have been able to run WarpX on CPU and GPU :).

In general, you can run the same simulation on CPU and GPU, but the parallelization may be different in order to utilize resources efficiently. Some information on how to run WarpX on a typical CPU platform (Cori@NERSC) and a typical GPU platform (Summit@OLCF) can be found on this page https://warpx.readthedocs.io/en/latest/running_cpp/platforms.html.

Could you provide additional info?

the platform you are running on for each cases
submission scripts
input files
full standard output (you provided a subset of this output in your message)

We will be able to help more with more data. In the meantime, I can already see that:

CUDA initialized with 20 GPU(s) and 90 ranks. WarpX is not meant to operate like this, you should have 1 rank per GPU. You may have to set a different max_grid_size between CPU and GPU, see https://warpx.readthedocs.io/en/latest/running_cpp/parallelization.html
- OMP initialized with 3 OMP threads Could you use a number of threads that is a power (or at least a multiple) of 2?
- These two simulations were initialized on 90 processes over 5 nodes. this should be related to the architecture you are running on, so it will typically be different for CPU and GPU runs. The WarpX page on Running on specific platforms mentioned above should illustrate this.

I am sure we can make the CPU and the GPU runs perform better!!

bzdjordje commented 4 years ago

Hi @MaxThevenet, good to hear from you again! Apologies for the delay, I have been having some issue recompiling WarpX again on GPU, being addressed in issue #1132.

Regarding your other requests, I am trying to compile/run on the following two machines: Lassen (GPU): https://hpc.llnl.gov/hardware/platforms/lassen (seems to be similar to Summit) Quartz (CPU): https://hpc.llnl.gov/hardware/platforms/Quartz

An example input file has been attached below: inputs_comp.txt

Submission scripts for the previous two machines (saved as .txt files): Lassen: rung.txt Quartz: runc.txt

Output files: Lassen: gpu_1216027.out.txt Quartz: cpu_1215996.out.txt

For Quartz the maximum for a simulation in debug mode is 30 minutes on 90 processes over 5 modes and so that is why those parameters were chosen. I am not exactly sure yet if this is the same for Lassen but kept that the same for the run given nevertheless.

MaxThevenet commented 4 years ago

Ok, thanks for the details. I think the main issue here is that this 2D problem is far too small for the number of nodes you are using. A few scans could help us find optimal conditions, but I suspect that a good starting point would be:

On Lassen, use 1 node, and only 1 GPU, 1 MPI rank, I am not sure about the syntax on this platform.
On Quartz, use only 1 node, with 4 MPI ranks and 8 OpenMP threads. Tentatively (assuming -T is the number of MPI ranks per node):
```
export OMP_NUM_THREADS=8
lrun -N1 -T4 ./main2dc.ex inputs
```
Also, could you change the value for amr.max_grid_size in your input script, and set it to amr.max_grid_size = 1024 instead?

bzdjordje commented 4 years ago

Ok, sorry for the delay, I had trouble recompiling the GPU version of WarpX per issue #1132 but it seems to work again now.

So I configured the Lassen/GPU and Quartz/CPU simulations per your recommendations and I did indeed see a GPU speed up this time! 11s vs 17s. Is there a simple way to scale this for a bigger problem? For example this problem had about 32,000 cells, 25 particles-per-cell, and a 30 fs laser, but our actually baseline, Quasi-1D problem has more like a minimum of 300,000 cells, 1000 ppc, and a 200 fs laser.

Lassen/GPU: lassen_gpu_out.txt

Quartz/CPU: quartz_cpu_out.txt

ECP-WarpX / WarpX

GPU vs CPU Benchmark #1136