Improving GPU utilization

ejp-zz commented 10 years ago

When I run the Taylor Green vortex test case I find that only 21% of the GPU (Tesla C2050) is utilized. [ejeyapau@r219i0n0 ~]$ nvidia-smi Mon Jun 23 17:34:39 2014
+------------------------------------------------------+
| NVIDIA-SMI 3.295.41 Driver Version: 295.41 |
|-------------------------------+----------------------+----------------------+ | Nb. Name | Bus Id Disp. | Volatile ECC SB / DB | | Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. | |===============================+======================+======================| | 0. Tesla M2090 | 0000:02:00.0 Off | 0 0 | | N/A N/A P0 137W / 225W | 7% 383MB / 5375MB | 21% Default | |-------------------------------+----------------------+----------------------| | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0. 78932 ../../../bin/HiFiLES 370MB | +-----------------------------------------------------------------------------+

Does this mean that all of the 448 processing units are being utilized on the Tesla C2050 card? I am running this on a 12 core machine with 1 GPU card. For better utilization, here are few options, 1.run multiple jobs on the same GPU 2.run a GPU job (uses 1 core) and a CPU job utilizing remaining 11 cores. Are both good options?

Or rather, is there a way to estimate the optimal GPU requirements (memory & load) for a given problem. Sorry, the question is not directly related to the code. Thanks, Elbert

CottonTensor commented 10 years ago

The GPU utilization that NVIDIA-smi shows is the percentage of time in the last second of operation of GPU kernels that the GPUs were used - which is not a very good measure of GPU utilization. You can read up more on

http://stackoverflow.com/questions/5086814/how-is-gpu-and-memory-utilization-defined-in-nvidia-smi-results

http://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t

So, this low statistic could very well be because of some small amounts of thread divergence. Although we have tried to avoid thread divergence largely, there are still some cases where it could has some effects. i.e. Some of the threads running the kernel might take longer than others because the code is different for different "if" conditions.

In order to get a good idea of utilization you will have to do complete profiling. As the people in the thread say "Utilization is not a how well you're using the resources statistic but if you're using the resources".

Let me just comment on the 2 options you suggested:

This is a good idea. However, for most of the problems we run, the number of threads we run (which is equal to the number of solution points or flux points) is usually quite large and it should take up all the threads on the GPU. If it is not taking up all the threads on the GPU, then what other operations are there to run anyway? Many of the kernels need synchronization after they are complete so you can't just run following functions all the time (when it is possible, it is done).
The code architecture with explicit time stepping is such that GPUs are much faster than CPUs. Think about dividing each solution point computation to a separate thread on GPUs (which are in 1000s) while doing them serially on CPUs using for loops. So in general, we have found that a modest number of CPUs can't compete with a GPU easily. And since they all need to communicate at the end of each step, there is additional time lag which would only make it slower than just running on 1 GPU!

We still have work to do on the GPU optimization and hopefully we will make it faster in time. Thanks for the insightful comments. They'll help us improve the code.

Cheers Abhishek

ejp-zz commented 10 years ago

Hi Abhishek, Thanks for the clarifying. Please let us know when a more optimized GPU version becomes available. Elbert

HiFiLES / HiFiLES-solver

Improving GPU utilization #13