Memory Performance on K80 + P100 + V100

tdd11235813 commented 6 years ago

Hi,

I have used your nice benchmark tool again to compare Kepler K80, Pascal P100 and Volta V100 memory bandwidths.

It seems that P100 and V100 could increase their Shared Memory + Constant Memory throughputs.
Texture cache cannot keep up with other cache bandwidth improvements

I will look into it, but maybe you already know the reasons, so we can discuss possible benchmark changes here.

K80 (unifies L1 cache and shared memory, 13 SMs)
===============================================
        Read only accesses:
                int1:    1245.34 GB/sec # ld.ca.u32
                int2:    1148.30 GB/sec # ld.ca.u64
                int4:    1176.73 GB/sec # ld.ca.v4.u32
                max:     1245.34 GB/sec
        Read-write accesses: # probably not reliable
        # alternating load, store, load, store, ...
        # load and store cache lines are different
                int1:     574.34 GB/sec # .., st.cs.global.u32
                int2:     574.73 GB/sec # .., st.cs.global.u64
                int4:     581.64 GB/sec # .., st.cs.global.v4.u32
                max:      581.64 GB/sec

# more results:
 L2 read:       333.02 GB/sec # best of ld.cg.u32|64|.v4.u32
 L2 read-write: 575.83 GB/sec # probably not reliable
 Shared Memory: 2680.82 GB/sec (best performance with 64-128bit words)
 Texture loads: 1330.79 GB/sec (best with 128bit words)
 constant cache: 3541.51 GB/sec (best with 64bit words)
 gmem throughput: 182.46 GB/s # non-cached writes-only

P100 (unifies L1 cache and texture cache, 56 SMs)
================================================
        Read only accesses:
                int1:    2116.58 GB/sec
                int2:    2317.34 GB/sec
                int4:    2379.57 GB/sec
                max:     2379.57 GB/sec
        Read-write accesses:
                int1:    2227.59 GB/sec
                int2:    2209.56 GB/sec
                int4:    2096.81 GB/sec
                max:     2227.59 GB/sec

# more results:
 L2 read:       1570.65 GB/sec
 L2 read-write: 1795.95 GB/sec
 Shared Memory: 7809.40 GB/sec (best performance with 32-128bit)
 Texture loads: 2379.58 GB/sec (best with 128bit words)
 constant cache: 9585.95 GB/sec (best with 64bit-128bit)
 gmem throughput: 577.53 GB/s

V100 (unifies L1 cache and shared memory and texture cache, 80 SMs)
================================================
        Read only accesses:
                int1:    7898.13 GB/sec
                int2:   12551.22 GB/sec
                int4:   13343.58 GB/sec
                max:    13343.58 GB/sec
        Read-write accesses:
                int1:    3206.65 GB/sec
                int2:    3145.10 GB/sec
                int4:    3197.08 GB/sec
                max:     3206.65 GB/sec

# more results:
 L2 read:       2615.99 GB/sec
 L2 read-write: 2611.28 GB/sec
 Shared Memory: 10025.63 GB/sec (best performance with 128bit)
 Texture loads: 3547.75 GB/sec (best with 128bit words)
 constant cache: 7596.33 GB/sec (best with 64bit-128bit)
 gmem throughput: 855.98 GB/s

Selected Peak Comparisons

K80 (unifies L1 cache and shared memory, 13 SMs)
===============================================
......... Throughput ...... PEAK Bandwidth ...
gdram: .. 182.46 GB/s ..... 240.48 GB/s ...
( ECC Throughput is not included in measurement )

L1: .... 1245.34 GB/s ..... 1369.472 GB/s
(peak L1 bandwidth = 13 SMs * 128 Byte * 823 MHz)

Shared Memory: 2680.82 GB/s ... 2738.944 GB/s
(as there are 64-bit banks, so L1 is per design only half of shared memory bandwidth)

P100 (unifies L1 cache and texture cache, 56 SMs)
================================================
......... Throughput ...... PEAK Bandwidth ...
gdram: .. 577.53 GB/s .... 732.16 GB/s ...
L1:....... 2379.57 GB/s ... ? (tex-cache bandwidth?)
Shared Memory: 7809.40 GB/s ... 9519.104 GB/s

V100 (unifies L1 cache and shared memory and texture cache, 80 SMs)
================================================
......... Throughput ...... PEAK Bandwidth ...
gdram: .. 855.98 GB/s ..... 898.05 GB/s ...
L1: ...... 13343.58 GB/s ... 14131.2 GB/s
Shared Memory: 10025.63 GB/s .. 14131.2 GB/s

ekondis commented 6 years ago

Constant memory benchmark involves unwanted computations (to avoid code elimination by the compiler) that potentially degrade performance. So, measurements could possibly be affected.

Regarding the shared memory benchmark, could you please try using a larger workload (by increasing VECTOR_SIZE in main.cpp)? Perhaps current setting is a small one for volta. To my experience shared memory b/w can be measured quite accurately with this tool.

Could you be more specific on your texture benchmark observations?

Update: Thank you for your kind comments.

tdd11235813 commented 6 years ago

Thanks for your quick response. The shared memory throughput improved by increasing VECTOR_SIZE to 1024*1024*64. Higher values up to 1024*1024*256 (limited by 32-bit datasize) have not changed the picture.

P100

Shared Memory throughput
        using  32bit operations   : 8653.17 GB/sec (2163.29 billion accesses/sec)
        using  64bit operations   : 9066.01 GB/sec (1133.25 billion accesses/sec)
        using 128bit operations   : 9276.18 GB/sec (579.76 billion accesses/sec)
# peak bandwidth is: 9519.104 GB/s

V100

Shared Memory throughput
        using  32bit operations   : 9628.86 GB/sec (2407.21 billion accesses/sec)
        using  64bit operations   :11495.49 GB/sec (1436.94 billion accesses/sec)
        using 128bit operations   :12154.74 GB/sec (759.67 billion accesses/sec)
# peak bandwidth is: 14131.2 GB/s # so still some gap

Regarding the texture cache performance. V100 Texture loads (3547.75 GB/s) should have nearly the same performance as L1 or Shared Memory, (12154.74 GB/s) as the texture cache is on the same on-chip memory. Maybe some tweaking is required due to the unified L1/SMem/Tex cache (tuning guide). At the moment it looks like it is limited by L2 bandwidth.

Regarding the constant memory. Have you already tried benchmarking ld.const directly like you did in the cachebench code (link ptx doc) ?

ld.const.s32    q, [constant_data];

Other than that, the throughput improvements on the new cards are quite impressive.

ekondis commented 6 years ago

Perhaps it has to do with the GPU core frequency scaling. I see that you estimate peak bandwidth based on the boost frequency. Is it safe to say that this is the sustained frequency during the experiments? Maybe the issue it is due to the ramp up during the execution.

On my experiments on recent consumer GPUs (e.g. GTX-1060) the max L1 b/w is similar to max texture b/w (e.g. 1188.20 vs 1158.31 GB/sec for GTX-1060) due to the unified L1/texture cache as you already said. I'm not sure though that this should also apply to shared memory performance. What is the b/w that you observe using L1 and texture memory for P100 & V100?

Truth is that I haven't tried using inline PTX for constant memory. I'm afraid that if I still don't utilize the fetched data in code the compiler will eliminate the constant memory accesses.

tdd11235813 commented 6 years ago

I would assume that the clock frequency could affect performance only on a real small percentage as runtimes are too short to heat up. Haven't checked whether clock throttling happened during the runs though. I'll look into this next days.

On the V100 the 4 texture units per SM use the L1 data cache. So I would expect the same throughput. However, gpumembench only reaches ~1/4 of the peak bandwidth at the moment.

ekondis commented 6 years ago

I've observed that sometimes the problem is that the warm up stage is not sufficient. The duration of the execution is too short and as such the clock frequencies do not have the chance to stabilize to boost clock frequencies. Could you duplicate some particular kernel execution calls to ensure that you get uniform execution times?

So, on V100 you get a quarter of bandwidth by using texture memory compared to L1/L2 cached global memory. I find this strange. I haven't read yet the volta architecture details as I don't have access on a Volta GPU :(

tdd11235813 commented 6 years ago

ok, clock rate and kernels are fine, maybe the benchmark code must be updated and there are some lines which looks like being written in the days of Fermis (cudaThreadSynchronize, bindTexture, ..). Nevertheless, it does not explain the low throughput for texture cache. I run nvprof --metrics tex_cache_throughput on it, which shows higher throughputs. Maximum was around:

Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-16GB (0)"
[...]
Kernel: void benchmark_func<int, bool=1, int=256, int=11, int=0>(int*)
          1                      tex_cache_throughput            Unified cache to SM throughput  5951.9GB/s  5951.9GB/s  5951.9GB/s
[...]

(also checked K80, where the results have been consistent at first glance).

Unfortunately, for const cache I could not find a metric, and I doubt tex_cache_throughput is meant for that (eg. would return 0's on K80 anyways):

Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-16GB (0)"
    Kernel: void benchmark_constant<int2>(int*)
          1                      tex_cache_throughput            Unified cache to SM throughput  7.7881GB/s  7.7881GB/s  7.7881GB/s

tdd11235813 commented 6 years ago

According to texture cache do you request a 128-byte row per warp? A texture cache loves 2D access patterns, so I would expect a better bandwidth, when a warp requests squares of memory. Maybe the tex cache implements some kind of space-filling curves where you would loose performance if you just fetch a 1D line of memory. Btw, nevermind the bindTexture remark above, I haven't noticed that the texture reference API came back with Maxwell+ GPUs (I used to work with Kepler, where texture object API was promoted).

ekondis commented 6 years ago

Yes, texture memory is managed in the traditional way as the benchmark had to run on at least Fermi GPUs.

Each thread requests either a 32, 64 or 128 bit element which entails 128, 256 or 512 byte per warp accesses. Of course, texture cache favors 2D locality accesses but that does not explain getting constantly lower performance.

Could you profile the performance metrics _tex_cachetransactions, _l2_readtransactions & _dram_readtransactions (element size:16bytes)? How do these compare to GP100 results?

tdd11235813 commented 6 years ago

Finally found some time to measure the metrics :) For full measurement results see attachment. Hope it helps. Edit: the used command:

nvprof --csv --log-file k40.csv --metrics tex_cache_transactions,l2_read_transactions,dram_read_transactions,tex_cache_hit_rate,tex_cache_throughput ./cachebench-tex-loads

V100

void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  tex_cache_transactions  Unified cache to SM transactions    671093760
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  l2_read_transactions    L2 Read Transactions    11702
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  dram_read_transactions  Device Memory Read Transactions 618
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  tex_cache_hit_rate  Unified Cache Hit Rate  99.999237%
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  tex_cache_throughput    Unified cache to SM throughput  2984.178155GB/s

P100

void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  tex_cache_transactions  Unified Cache Transactions  939524096
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  l2_read_transactions    L2 Read Transactions    7270
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  dram_read_transactions  Device Memory Read Transactions 533
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  tex_cache_hit_rate  Unified Cache Hit Rate  99.999237%
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  tex_cache_throughput    Unified Cache Throughput    1998.494135GB/s

K40

void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  tex_cache_transactions  Texture Cache Transactions  251658240
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  l2_read_transactions    L2 Read Transactions    8990
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  dram_read_transactions  Device Memory Read Transactions 1054
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  tex_cache_hit_rate  Texture Cache Hit Rate  99.996948%
void benchmark_func<int4, bool=1, int=256, int=1, int=1024>(int4*)  tex_cache_throughput    Texture Cache Throughput    1330.198558GB/s

k40.txt p100.txt v100.txt

ekondis commented 6 years ago

Thanks for the data. I did investigated them but on second thought I believe that for the Volta case it is not the just the texture throughput that is reduced but the L1 throughput (~13343) might be overestimated. Can you verify that this is consistent with values provided my the tex_cache_throughput (I guess this metric also measures L1 throughput as on Pascal GPU since it is a unified cache)?

Next, could you test if tex_utilization reaches Max(10) in case we get the highest throughput?

tdd11235813 commented 6 years ago

ah yes, tex_utilization is a good point to check!

# from nvprof --query-metrics
K40> tex_utilization:  The utilization level of the texture cache relative to the peak utilization on a scale of 0 to 10
P100> tex_utilization:  The utilization level of the unified cache relative to the peak utilization
V100> tex_utilization:  The utilization level of the unified cache relative to the peak utilization

I have rerun the benchmarks for ./cachebench-tex-loads, now including tex_utilization:

For V100 tex cache utilization only reaches Mid(5) with ~6 TB/s at maximum (with cache hit rate 100%), while P100 reaches Max(10) with 2TB/s tex cache throughput. This means, the Max(10) would refer to a throughput of about 12 TB/s on the V100. This also means, that the output is not consistent to the profiler values, more just like the half of the nvprof measured values.

With V100's L1+SMem+Tex cache being unified, I calculate the peak bandwidth by:

80 SMs x 1380 MHz x 128 Byte = 14131 GB/s

which would almost fit the 13343.58 GB/s from the V100 L1 cache benchmark. And if we double the 6 TB/s - as we reached only Mid(5) - we also would come closer to this peak bandwidth. (The cache granularity might be 32 Byte on V100, just like the P100.)

K40

k40.txt

void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_cache_transactions | Texture Cache Transactions | 125829120 | 125829120 | 125829120
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | l2_read_transactions | L2 Read Transactions | 4722 | 4722 | 4722
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | dram_read_transactions | Device Memory Read Transactions | 1523 | 1523 | 1523
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_cache_hit_rate | Texture Cache Hit Rate | 99.996948% | 99.996948% | 99.996948%
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_cache_throughput | Texture Cache Throughput | 1323.325464GB/s | 1323.325464GB/s | 1323.325464GB/s
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_utilization | Texture Cache Utilization | Max (10) | Max (10) | Max (10)

Peak bandwidth measurements per element size and access type
        Read only accesses:
                int1:     710.43 GB/sec
                int2:    1326.05 GB/sec
                int4:    1428.25 GB/sec
                max:     1428.25 GB/sec
        Read-write accesses:
                int1:     616.27 GB/sec
                int2:     615.17 GB/sec
                int4:     612.02 GB/sec
                max:      616.27 GB/sec

P100

p100.txt

void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_cache_transactions | Unified Cache Transactions | 469762048 | 469762048 | 469762048
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | l2_read_transactions | L2 Read Transactions | 7422 | 7422 | 7422
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | dram_read_transactions | Device Memory Read Transactions | 1265 | 1265 | 1265
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_cache_hit_rate | Unified Cache Hit Rate | 99.998474% | 99.998474% | 99.998474%
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_cache_throughput | Unified Cache Throughput | 2005.056753GB/s | 2005.056753GB/s | 2005.056753GB/s
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_utilization | Unified Cache Utilization | Max (10) | Max (10) | Max (10)

Peak bandwidth measurements per element size and access type
        Read only accesses:
                int1:    1064.87 GB/sec
                int2:    2379.02 GB/sec
                int4:    2380.25 GB/sec
                max:     2380.25 GB/sec
        Read-write accesses:
                int1:    2136.47 GB/sec
                int2:    2072.00 GB/sec
                int4:    2089.12 GB/sec
                max:     2136.47 GB/sec

V100

v100.txt

void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_cache_transactions | Unified cache to SM transactions | 671093760 | 671093760 | 671093760
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | l2_read_transactions | L2 Read Transactions | 5222 | 5222 | 5222
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | dram_read_transactions | Device Memory Read Transactions | 1034 | 1034 | 1034
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_cache_hit_rate | Unified Cache Hit Rate | 99.999237% | 99.999237% | 99.999237%
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_cache_throughput | Unified cache to SM throughput | 5961.977229GB/s | 5961.977229GB/s | 5961.977229GB/s
void benchmark_func<int, bool=1, int=256, int=1, int=8192>(int*) | tex_utilization | Unified Cache Utilization | Mid (5) | Mid (5) | Mid (5)

Peak bandwidth measurements per element size and access type
        Read only accesses:
                int1:    1743.23 GB/sec
                int2:    3527.85 GB/sec
                int4:    3532.37 GB/sec
                max:     3532.37 GB/sec
        Read-write accesses:
                int1:    3186.04 GB/sec
                int2:    3090.58 GB/sec
                int4:    3188.18 GB/sec
                max:     3188.18 GB/sec

I know it is not easy for you just to guess without playing around with the actual hardware. At the moment I only can try things out by running the profilings and providing the data. Not sure how much code change is required to get more memory per thread requested and to get the output consistent with the profiler. Hope the measurements help :)

ekondis commented 6 years ago

I can say that when using the 32bit int type the doubled profiled throughput is caused by the larger texture access granularity. Just see that _tex_cache_hitrate approaches 50% when accessing large arrays, e.g. for kernel "void benchmark_func<int, bool=1, int=256, int=64, int=0>(int*)". Normally it should be 0% so that means that texture elements are physically accessed by a minimum of 64bit per thread. So when accessing 32bit elements, the 1st access performs the initial texture element access and the 2nd performs a cached access but both of them are accounted for _tex_cachethroughput by using the double of the requested size.

Regarding the max V100 throughput, I also have the same question. I would next investigate the rest of the utilization metrics, i.e. all metrics of form _XXXXutilization. If one of them reaches Max(10) then that could be a key of a bottleneck.

tdd11235813 commented 6 years ago

quick reply, on the V100 I see Max(10) for the dram_utilization so it is global memory bound, while on P100 and K40 tex cache reaches Max(10).

ekondis commented 6 years ago

Are you sure about this? I see on your latest V100 results that DRAM read transactions are just 1034.

tdd11235813 commented 6 years ago

ok, the Max values have been achieved on other instances, not for ´int, bool=1, int=256, int=1, int=8192´. Here, every utilization metric reported Low except the Mid of the tex utilization. I rerender the results on V100 with all metrics, stay tuned...

tdd11235813 commented 6 years ago

it looks like there is not enough data to fully utilize the texture cache as I cannot see bottlenecks in the mentioned case above. Attached all metrics measured on V100. v100_all.txt

tdd11235813 commented 6 years ago

err sry, found it, the Texture Function Unit Utilization is the bottleneck!

Edit:

tex_cache_hit_rate | Unified Cache Hit Rate | 100,00 % | 100,00 % | 100,00 %
stall_texture | Issue Stall Reasons (Texture) | 41,84 % | 41,84 % | 41,84 %
void benchmark_func<int2, bool=1, int=256, int=1, int=8192>(int2*) | 1 | l2_utilization | L2 Cache Utilization | Low (1) | Low (1) | Low (1)
void benchmark_func<int2, bool=1, int=256, int=1, int=8192>(int2*) | 1 | tex_utilization | Unified Cache Utilization | Mid (5) | Mid (5) | Mid (5)
void benchmark_func<int2, bool=1, int=256, int=1, int=8192>(int2*) | 1 | ldst_fu_utilization | Load/Store Function Unit Utilization | Low (1) | Low (1) | Low (1)
void benchmark_func<int2, bool=1, int=256, int=1, int=8192>(int2*) | 1 | cf_fu_utilization | Control-Flow Function Unit Utilization | Low (1) | Low (1) | Low (1)
void benchmark_func<int2, bool=1, int=256, int=1, int=8192>(int2*) | 1 | tex_fu_utilization | Texture Function Unit Utilization | Max (10) | Max (10) | Max (10)

ekondis commented 6 years ago

Now, that's an exhaustive profiling execution. Thanks. Though _tex_fuutilization is Max(10) for int2, the same metric is halved for int4, i.e. Mid(5). So, it's not clear which is the bottleneck in this case. Perhaps, something not reflected by the provided metrics. Any idea is welcome.

ekondis / gpumembench