nvprof validation of throughput and results (shmem_bench on K80)

tdd11235813 commented 7 years ago

Hi, thanks for your nice benchmark tool! Just sharing some results and thoughts here. I used the nvprof output (K80 GPU) on shmem_bench:

nvprof --metrics shared_load_transactions_per_request,shared_store_transactions_per_request,shared_load_throughput,shared_store_throughput,l1_shared_utilization ./shmembench

                           Metric Description         Min         Max         Avg
(float4)
  Shared Memory Load Transactions Per Request    2.000000    2.000000    2.000000
 Shared Memory Store Transactions Per Request    2.000000    2.000000    2.000000
                Shared Memory Load Throughput  1262.1GB/s  1262.1GB/s  1262.1GB/s
               Shared Memory Store Throughput  1262.1GB/s  1262.1GB/s  1262.1GB/s
                 L1/Shared Memory Utilization    Max (10)    Max (10)    Max (10)
(float2)
  Shared Memory Load Transactions Per Request    1.000000    1.000000    1.000000
 Shared Memory Store Transactions Per Request    1.000000    1.000000    1.000000
                Shared Memory Load Throughput  1316.3GB/s  1316.3GB/s  1316.3GB/s
               Shared Memory Store Throughput  1316.3GB/s  1316.3GB/s  1316.3GB/s
                 L1/Shared Memory Utilization    Max (10)    Max (10)    Max (10)
(float)
  Shared Memory Load Transactions Per Request    1.000000    1.000000    1.000000
 Shared Memory Store Transactions Per Request    1.000000    1.000000    1.000000
                Shared Memory Load Throughput  1312.6GB/s  1328.3GB/s  1325.2GB/s
               Shared Memory Store Throughput  1312.6GB/s  1328.3GB/s  1325.2GB/s
                 L1/Shared Memory Utilization    Max (10)    Max (10)    Max (10)

128bit incurs a 2-way bank conflict as expected. 32bit and 64bit have no bank conflicts. Throughput is always maximal.

[...]
Kernel execution time
        benchmark_shmem  (32bit):    60.088 msecs
        benchmark_shmem  (64bit):    30.345 msecs
        benchmark_shmem (128bit):    31.664 msecs
[...]
Memory throughput
        using  32bit operations   : 1430.39 GB/sec (357.60 billion accesses/sec)
        using  64bit operations   : 2832.41 GB/sec (354.05 billion accesses/sec)
        using 128bit operations   : 2714.41 GB/sec (169.65 billion accesses/sec)
        peak operation throughput :  357.60 Giga ops/sec
[...]

Kepler uses 64-bit bank width, so the 32bit benchmark pass yields half bandwidth. The load and store operations have equal amounts (6 stores for "init", 6 loads for "reduction", 2 load+store for "swap"). ~The results from above imply, load and store can happen in parallel (overlapped?).~ To avoid a confusion, a different benchmark could help with separated two runs, one for load throughput and one for store throughput. So, benchmark separation of store and load throughput would be nice to have and maybe validation of the computed results also would be nice to have to ensure that the compiler did not optimize a shared memory transaction away.

Best Regards.

PS: cudaThreadSynchronize is deprecated, you can use cudaDeviceSynchronize instead. PPS: I also played with shared memory bank width configurations and cache configurations, but does not seem to have any effect. PPPS: Is it possible to compute the theoretical peak bandwidth of shared memory? PPPPS: the computation of the memory clock rate seems to be wrong, just remove the division by factor two.

Update: sorry, wrong interpretation of results^^. If all shared mem transactions would be loads then nvprof would possibly show peak shared mem bandwidth? Regarding PPPS, one computes the peak bandwidth by "core_freq banks bank_width" for one SM. So peak bandwidth would be 2.7TB/s (core_freq 823MHz, 13 SMs). It would be nice to have the theoretical bandwidth next to the measured bandwidth.

ekondis commented 7 years ago

First, thanks for the feedback.

On your results, I can say they are normal. I've seen similar results on most Kepler GPUs.

Regarding your suggestion on having separate benchmarks for reading and writing I think it is a plausible thought. However, as long as the theoretical bandwidth can be approached with reads+writes I didn't bother with any further experiments. My guess is that the aggregate bandwidth will always be the same regardless of type of access mix. If this proves wrong then separate benchmarks would be useful.

Thanks for your P+S suggestions. PS: I'll correct this next time I'll update the code. PPPS: I think you've already answered this. PPPPS: What is the memory clock rate reported for your GPU by the benchmark?

Though it would be nice to report theoretical peak it would require embedding information for each architecture available (e.g. #banks, #LdSt units/SM)

tdd11235813 commented 7 years ago

Hi, thank you very much for your consideration. Clock rate of K80 is

Memory clock rate:   1252 MHz

which actually should be 2505 MHz.

Regarding theoretical peak of shared memory bandwidth the number of banks are 32 except for CC 1.x, where 16 banks are in place. It is not a must-have for me as I can compute it on my own or check it with nvprof, if load or store bandwidth reaches "Max(10)".

What do you think about some validation of results?

Of course, separating the benchmark into load+store would be a little bit different and it is probably better to use the device clock() routine (whose overhead, in turn, must be considered). See here for an implementation:

shared memory benchmark with code and paper

However, it is not bad to have two different implementations for the shared memory bandwidth benchmark, as long as the assumptions made are true. This applies to the aggregation of load+store bandwidths as well as to the number of cycles used by clock and syncthreads given in the aforementioned alternative benchmark.

Update: PS: I cannot recommend the benchmark linked as an alternative since it is error-prone and uncomfortable to use and it also aggregates load+stores. ~So I still wonder why nvprof shows Max(10) at half peak bandwidth (=1.35TB/s). If a Kepler SM is able to consume 32x 8bytes per cycle from shared memory then the max load throughput should be 2.7TB/s here (0.823328*13)!?~ Update: last thought of last update was rubbish^^

ekondis commented 7 years ago

According to GPU specs (https://www.techpowerup.com/gpudb/2616/tesla-k80m) the reported clock rate seems to be correct (1253MHz on specs).

I suspect that the utilization stays at peak rates as long as one provides a large amount load/store requests regardless of the width of each request. Thus, using 32bit load/stores is also capable of stressing shared memory to its highest utilization, though working at half of its possible efficiency.

If you are interested to provide multiple implementations (read only, read/write, write only) to shmembench I'd be glad to discuss merging them.

tdd11235813 commented 7 years ago

Ok, I see. What clock rate will be reported by CUDA device properties on a non-GDDR5 GPU? A K80 has GDDR5 with quad data rate (QDR) thus that memory clock rate property reports the effective DDR memory speed and not the real clock. The programming guide could not help me here, but probably you know more. nvidia-smi also works with the effective DDR memory speed.

List of Nvidia GPUs and specs

With your help and nvidia forum I now have these answers:

nvprof L1/Shared Memory Utilization ~does not take boost clock scenario into account (my measurements were run at boosted clock rate) and it~ should be rather used for relative comparisons
there are no ECC penalties on shared memory
there is a 5-25% gap between theoretical and real peak bandwidth for global memory, but such a gap is not known for shared memory - your benchmark proves that there is no such gap at least on a K80

According to read+write scenario I follow your thoughts, ie. "as long as the theoretical bandwidth can be approached with reads+writes I didn't bother with any further experiments"

Update: the utilization metrics do take clock rates into account, to quote myself: " The L1/Shared Utilization refers to the clock set at runtime which makes sense anyway. So it always shows e.g. "High (8)" from 562 MHz, 823 MHz to 875 MHz. This is also true for alu_fu_utilization and ldst_fu_utilization. [..] Of course, L1/Shared Utilization also includes L1 transactions, so for a pure metric one has to compute it based on the shared memory throughput values [..]"

ekondis commented 7 years ago

Vendor marketing tactics ;)

Your answers seem sensible to me.

tdd11235813 commented 7 years ago

Regarding the memory clock rate. If you have HBM2 equipped cards then memoryClockRate already returns the real clock. See my post here.

I also had updated post above about the utilization metric and that the boost scenario is taken into account.

ekondis commented 7 years ago

I'm sure I understand the contradiction. CUDA always seems to return the real memory clock instead of the effective. Anyway, this is a CUDA related issue.

tdd11235813 commented 7 years ago

I am not sure if CUDA will add a real-clock property in the near future. With respect to the common GPUs there is the rule that CUDA returns DDR effective clock when there is GDDR5 and CUDA returns real clock if there is HBM2 (to keep the theoretical bandwidth computation 2*cudaDeviceProp.memoryClockRate consistent). As a workaround one might check for memory bus width which is like 4096 bits for HBM2 where GDDR5 has 384bits. Another way is to not present the real memory speed but the effective memory speed which consistently can be computed by 2*memoryClockRate.

Feel free to close this issue. The validation of the results is still a nice to have, but not important for me as I use the shared memory throughputs returned by nvprof.

ekondis commented 7 years ago

Ok, thank you for the information.

ekondis / gpumembench

nvprof validation of throughput and results (shmem_bench on K80) #1