Open dineshadepu opened 3 weeks ago
100k particles is often the break-even point for performance comparisons between a CPU node and single GPU. It's generally (but not necessarily) enough work to fully utilize even a single GPU, let alone two.
You can certainly still get performance improvements, particularly from avoiding memory allocation (as in the previous issue), communication, etc. The important question is what is the timing breakdown for the CPU and GPU? It's very likely the code is spending very different amounts of time in different sections for each hardware
Thanks for the input, Sam. I will leave this open and continue with the rest of the code development. I will update on this once I do complete profiling of the code on both architectures.
Hi all,
Similar to #748 this is also a question.
I have a HPC system with the current configuration:
AMD Ryzen Threadripper PRO 5975WX 32-Cores and two NVIDIA RTX A5500 GPUs .
In issue #748 I had mentioned that I am dealing with SPH-DEM solver and implemented both SPH (has bugs) and DEM solver (Done) independently so far (not coupled yet). I had ran the
settling_of_bodies_in_tank.cpp
on both parallel CPU cores and on GPU. Here are the run times:I used the following command to run:
time ./examples/03RBBodiesSettling 0.1 1.0 1.0 1000 0.1 200
Which considers a total no of bodies of 1000, and a total time of $0.1$ seconds, for a total of $1800$ steps. A 1000 rigid bodies resulted in $127436$ of particles.
The total time taken is :
However, the
ExaMPM
code developed byCabana
developers is very fast on GPU when compared to parallel CPU runtime. For comparision, I tookDamBreak
example, and run with the following commandtime ./examples/DamBreak 0.05 2 0 0.001 1.0 10 OpenMP
and for GPU
time ./examples/DamBreak 0.05 2 0 0.001 1.0 10 CUDA
I get the following run times:
The CUDA run is almost 30 times faster than the CPU run. Unfortunately, I am unable to get the same numbers for my own code. I believe I followed the best practices. I am really not sure why I am lacking with this performance boost. Is it that, in my case I am using two
AoSoA
's or something else. Is there a way to debug this performance issue. I am almost ready with both SPH and DEM codes, just the coupling is left out to be added. Can you please help me with this issue?Thank you so much. I will provide any additional information regarding this.