Open bzdjordje opened 4 years ago
Thanks for sharing the results @bzdjordje, and great to see that you have been able to run WarpX on CPU and GPU :).
In general, you can run the same simulation on CPU and GPU, but the parallelization may be different in order to utilize resources efficiently. Some information on how to run WarpX on a typical CPU platform (Cori@NERSC) and a typical GPU platform (Summit@OLCF) can be found on this page https://warpx.readthedocs.io/en/latest/running_cpp/platforms.html.
Could you provide additional info?
We will be able to help more with more data. In the meantime, I can already see that:
CUDA initialized with 20 GPU(s) and 90 ranks.
WarpX is not meant to operate like this, you should have 1 rank per GPU. You may have to set a different max_grid_size between CPU and GPU, see https://warpx.readthedocs.io/en/latest/running_cpp/parallelization.html
OMP initialized with 3 OMP threads
Could you use a number of threads that is a power (or at least a multiple) of 2?These two simulations were initialized on 90 processes over 5 nodes.
this should be related to the architecture you are running on, so it will typically be different for CPU and GPU runs. The WarpX page on Running on specific platforms mentioned above should illustrate this.I am sure we can make the CPU and the GPU runs perform better!!
Hi @MaxThevenet, good to hear from you again! Apologies for the delay, I have been having some issue recompiling WarpX again on GPU, being addressed in issue #1132.
Regarding your other requests, I am trying to compile/run on the following two machines: Lassen (GPU): https://hpc.llnl.gov/hardware/platforms/lassen (seems to be similar to Summit) Quartz (CPU): https://hpc.llnl.gov/hardware/platforms/Quartz
An example input file has been attached below: inputs_comp.txt
Submission scripts for the previous two machines (saved as .txt files): Lassen: rung.txt Quartz: runc.txt
Output files: Lassen: gpu_1216027.out.txt Quartz: cpu_1215996.out.txt
For Quartz the maximum for a simulation in debug mode is 30 minutes on 90 processes over 5 modes and so that is why those parameters were chosen. I am not exactly sure yet if this is the same for Lassen but kept that the same for the run given nevertheless.
Ok, thanks for the details. I think the main issue here is that this 2D problem is far too small for the number of nodes you are using. A few scans could help us find optimal conditions, but I suspect that a good starting point would be:
export OMP_NUM_THREADS=8
lrun -N1 -T4 ./main2dc.ex inputs
Also, could you change the value for amr.max_grid_size
in your input script, and set it to amr.max_grid_size = 1024
instead?
Ok, sorry for the delay, I had trouble recompiling the GPU version of WarpX per issue #1132 but it seems to work again now.
So I configured the Lassen/GPU and Quartz/CPU simulations per your recommendations and I did indeed see a GPU speed up this time! 11s vs 17s. Is there a simple way to scale this for a bigger problem? For example this problem had about 32,000 cells, 25 particles-per-cell, and a 30 fs laser, but our actually baseline, Quasi-1D problem has more like a minimum of 300,000 cells, 1000 ppc, and a 200 fs laser.
Lassen/GPU: lassen_gpu_out.txt
Quartz/CPU: quartz_cpu_out.txt
Is there a best way to compare performance between GPU and CPU? Is there a specific way to compile the two different main2d files? Everything should be the same between the two except for USE_GPU. I am also using USE_OMP=TRUE and compiling on gcc/7.3.1. Currently GPU is running 5x slower than on CPU. My example outputs are as follow:
GPU: Initializing CUDA... CUDA initialized with 20 GPU(s) and 90 ranks. MPI initialized with 90 MPI processes MPI initialized with thread support level 0 AMReX (20.06-96-gd93db0a9500d) initialized ... STEP 1000 ends. TIME = 4.38635098e-13 DT = 4.38635098e-16 Walltime = 201.8242962 s; This step = 0.181442859 s; Avg. per step = 0.2018242962 s Writing plotfile diags/diag101000 Total Time : 214.0729109
CPU MPI initialized with 90 MPI processes MPI initialized with thread support level 0 OMP initialized with 3 OMP threads AMReX (20.06-96-gd93db0a9500d-dirty) initialized ... Total Time : 43.27789566
As you can see the GPU instance is initialized with 20 GPUs and 0 threads, while CPU has 3 OMP threads. These two simulations were initialized on 90 processes over 5 nodes.