TensorBFS / CuTropicalGEMM.jl

The fastest Tropical number matrix multiplication on GPU
MIT License
9 stars 0 forks source link

Failure of BenchmarkTools #19

Open ArrogantGao opened 1 year ago

ArrogantGao commented 1 year ago

BenchmarkTools are not working correctly:

julia> using TropicalNumbers, CUDA, BenchmarkTools, LinearAlgebra, CuTropicalGEMM

julia> a = Tropical.(CUDA.randn(4096, 4096));

julia> @btime $a * $a;
  3.375 μs (7 allocations: 256 bytes)

julia> @benchmark $a * $a
BenchmarkTools.Trial: 158 samples with 8 evaluations.
 Range (min … max):   3.554 μs …    1.733 s  ┊ GC (min … max): 0.00% … 0.07%
 Time  (median):      3.976 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.475 ms ± 137.779 ms  ┊ GC (mean ± σ):  0.06% ± 0.01%

  █                                                          ▄
  █▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▄
  3.55 μs       Histogram: log(frequency) by time      13.5 ms <

 Memory estimate: 256 bytes, allocs estimate: 7.

Comparing to results directly from the C-CUDA tests, the result of @ benchmark is correct.

GiggleLiu commented 1 year ago

Is it possible that the issue is related to CuStream? Because CUDA.synchronize() function takes CuStream as the input argument. Maybe you need to pass CUDA.jl CuStream to the C code.

In your current code, the stream id seems to be the default value 0.

GiggleLiu commented 1 year ago
julia> @time (mul!(b, a, a); CUDA.synchronize(CUDA.context()))
  0.093036 seconds (2 allocations: 48 bytes)

This one works. Using the CuContext as the input will synchronize all streams in the context, which is a heavy API. It would be great if the stream id can be an input.

GiggleLiu commented 11 months ago

The current benchmark in the RAEDME does not look good. When we read benchmark, we always read the min-time, because it reflects the true performance.

julia> using CuTropicalGEMM

julia> @benchmark CUDA.@sync $a * $a
BenchmarkTools.Trial: 93 samples with 4 evaluations.
 Range (min … max):   6.653 μs … 158.961 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.535 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.499 ms ±  15.867 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                             █  
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  6.65 μs         Histogram: frequency by time         13.5 ms <

 Memory estimate: 256 bytes, allocs estimate: 7.