Comparing Bede's GPU performance against consumer GPUs

Users of Bede may find that their recent, consumer grade GPUs may out-perform the V100s (and T4s) in Bede for certain tasks, which they may not expect.

In certain cases, this is a realistic outcome.

The documentation should be expanded to address this, explain potential reasons and how profilers can be used to determine the outcome.

Some of the main / likely reasons for this are:

Consumer GPUs (and CPUs) will have higher clock speeds than those used in HPC systems, for power and thermal reasons. This means that like-for-like a single CUDA core in the V100 may be outperformed by a single GPU core in a small consumer chip such as the 1650.
More recent architectures may benefit user workloads compared to those found in Bede. I.e. Turing and Ampere have higher per-SM FP32/int paths.
V100s have much higher FP64 compute ratios however, and much higher memory bandwidth than GDDR6 based cards (some GDDR6X Ampere GPUs will offer similar memory bandwidth).
Problem size. As V100's are large devices with 80 Streaming multiprocessors and up to 2048 resident threads per SM, at least 160 blocks of threads are required to fully utilise the device (or 80 to potentially be resident on activate each SM, full occupancy is not always achievable / required). This may be significantly higher than on consumer GPUs. If a problem is not large enough to use a decent amount of the V100, a smaller GPU may appear to perform the same or faster.

The advice that should be provided to understand this, is to use the nvidia profiling tools Nsight systems (nsys) and Nsight Compute (ncu). Basic usage of these is already described in a guide, and there are links to additional resources on using these.

However, it may be beneficial to expand / complement this with a worked example showing when a consumer GPU could be faster than a V100, and how the profilers can be used to illustrate this (e.g. simple problem that is big enough to use ~ 10SMs, but not 80).

N8-CIR-Bede / documentation

Comparing Bede's GPU performance against consumer GPUs #97