A few style tweaks, and use the registered version of GemmKernels.jl

Here's an example plot generated on a Titan V:

plot

Here's the code used to generate it:

using BLASBenchmarksGPU
import CUDA
bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
import PyPlot
BLASBenchmarksGPU.plotbench(bench_result, "plot.png")
CUDA.versioninfo()

And here's the output, including all of the TFLOPS values:

julia> using BLASBenchmarksGPU

julia> import CUDA

julia> bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
Progress: 100%|███████████████████████████████████████████████████| Time: 0:08:24
  Size:         16384
  CUBLAS:       62.9 TFLOPS
  GemmKernels:  66.19 TFLOPS
  Tullio:       0.29 TFLOPS
Bennchmark Result of Matrix{Float32}=Matrix{Float16}×Matrix{Float16}
24×4 DataFrame
 Row │ Size   Library      TFLOPS      Time
     │ Int64  Symbol       Float64     Float64
─────┼─────────────────────────────────────────────────
   1 │   128  CUBLAS        0.148099    28321.0
   2 │   128  GemmKernels   0.0332214  126253.0
   3 │   128  Tullio        0.0540789   77559.0
   4 │   256  CUBLAS        0.642928    52190.0
   5 │   256  GemmKernels   0.268756   124851.0
   6 │   256  Tullio        0.310169   108181.0
   7 │   512  CUBLAS        4.36686     61471.0
   8 │   512  GemmKernels   2.02769    132385.0
   9 │   512  Tullio        0.677953   395950.0
  10 │  1024  CUBLAS       25.168       85326.0
  11 │  1024  GemmKernels  14.2662     150529.0
  12 │  1024  Tullio        0.850125        2.52608e6
  13 │  2048  CUBLAS       59.5005     288735.0
  14 │  2048  GemmKernels  36.4345     471527.0
  15 │  2048  Tullio        0.76592         2.24304e7
  16 │  4096  CUBLAS       84.8186          1.62039e6
  17 │  4096  GemmKernels  57.559           2.38779e6
  18 │  4096  Tullio        0.393097        3.49631e8
  19 │  8192  CUBLAS       90.1617          1.21949e7
  20 │  8192  GemmKernels  59.9127          1.83519e7
  21 │  8192  Tullio        0.324503        3.38829e9
  22 │ 16384  CUBLAS       62.9003          1.39842e8
  23 │ 16384  GemmKernels  66.1909          1.3289e8
  24 │ 16384  Tullio        0.294184        2.98999e10

julia> import PyPlot

julia> BLASBenchmarksGPU.plotbench(bench_result, "plot.png")

julia> CUDA.versioninfo()
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.27.4

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.27.4
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.0-beta1
- LLVM: 11.0.0
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_VERBOSE: true

1 device:
  0: TITAN V (sm_70, 658.500 MiB / 11.784 GiB available)

JuliaLinearAlgebra / BLASBenchmarksGPU.jl

A few style tweaks, and use the registered version of GemmKernels.jl #25