ECP-copa / Cabana

Performance-portable library for particle-based simulations
Other
188 stars 51 forks source link

Cabana-based code takes same time to run on both GPU and on parallel CPU cores #749

Open dineshadepu opened 3 weeks ago

dineshadepu commented 3 weeks ago

Hi all,

Similar to #748 this is also a question.

I have a HPC system with the current configuration:

AMD Ryzen Threadripper PRO 5975WX 32-Cores and two NVIDIA RTX A5500 GPUs .

In issue #748 I had mentioned that I am dealing with SPH-DEM solver and implemented both SPH (has bugs) and DEM solver (Done) independently so far (not coupled yet). I had ran the settling_of_bodies_in_tank.cpp on both parallel CPU cores and on GPU. Here are the run times:

I used the following command to run:

time ./examples/03RBBodiesSettling 0.1 1.0 1.0 1000 0.1 200

Which considers a total no of bodies of 1000, and a total time of $0.1$ seconds, for a total of $1800$ steps. A 1000 rigid bodies resulted in $127436$ of particles.

The total time taken is :

OpenMP Cuda
9.17 seconds 9.8 seconds

However, the ExaMPM code developed by Cabana developers is very fast on GPU when compared to parallel CPU runtime. For comparision, I took DamBreak example, and run with the following command

time ./examples/DamBreak 0.05 2 0 0.001 1.0 10 OpenMP

and for GPU

time ./examples/DamBreak 0.05 2 0 0.001 1.0 10 CUDA

I get the following run times:

OpenMP Cuda
33 seconds 0.98 seconds

The CUDA run is almost 30 times faster than the CPU run. Unfortunately, I am unable to get the same numbers for my own code. I believe I followed the best practices. I am really not sure why I am lacking with this performance boost. Is it that, in my case I am using two AoSoA's or something else. Is there a way to debug this performance issue. I am almost ready with both SPH and DEM codes, just the coupling is left out to be added. Can you please help me with this issue?

Thank you so much. I will provide any additional information regarding this.

streeve commented 3 weeks ago

100k particles is often the break-even point for performance comparisons between a CPU node and single GPU. It's generally (but not necessarily) enough work to fully utilize even a single GPU, let alone two.

You can certainly still get performance improvements, particularly from avoiding memory allocation (as in the previous issue), communication, etc. The important question is what is the timing breakdown for the CPU and GPU? It's very likely the code is spending very different amounts of time in different sections for each hardware

dineshadepu commented 2 weeks ago

Thanks for the input, Sam. I will leave this open and continue with the rest of the code development. I will update on this once I do complete profiling of the code on both architectures.