TensorBFS / TropicalGEMM.jl

The fastest tropical matrix multiplication in the world!
MIT License
29 stars 2 forks source link

Fix cuarray dispatch & reduce TTFX #22

Closed GiggleLiu closed 1 year ago

GiggleLiu commented 1 year ago

Before

julia> @time (using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); x * y)
 10.316412 seconds (18.54 M allocations: 1.056 GiB, 3.12% gc time, 92.85% compilation time)

After

julia> @time (using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); x * y)
  1.196553 seconds (2.16 M allocations: 182.519 MiB, 5.15% gc time, 1.17% compilation time)

This is amazing speed up, which greatly improves the usability of Octavian and TropicalGEMM.

@chriselrod I notice that in Octavian, PrecompileTools is imported but not used. Do you have special concerns like compatibility or binary size? Otherwise, I just confirmed the following speed up of TTFX after the precompilation:

julia> @time (using Octavian; T = Float64; x = rand(T, 10, 10); y = rand(T, 10, 10); Octavian.matmul(x, y))
  0.796285 seconds (1.38 M allocations: 93.804 MiB, 4.00% gc time, 1.78% compilation time)

Before this change, the TTFX is 6s on my machine.

Update

The binary size is also "amazing":

-rw-rw-r-- 1 leo leo     19466 Jun 30 14:24 6vpGR_SKm48.ji
-rwxrwxrwx 1 leo leo 486264528 Jun 30 14:24 6vpGR_SKm48.so
GiggleLiu commented 1 year ago

Since this is an easy fix and the tests pass. I will merge this PR directly.

chriselrod commented 1 year ago

I forgot to remove it in https://github.com/JuliaLinearAlgebra/Octavian.jl/pull/181 Precompilation often results in dynamic dispatches in the precompiled code that goes away if you don't precompile it.

GiggleLiu commented 1 year ago

I can not reproduce the performance issue. I did not see dynamic dispatch in the benchmark so I will keep the precompilation for now. The bug in Octavian is really wield, I can benchmark it on my machine for you to see if it is CPU architecture dependent.

Before

julia> using BenchmarkTools

julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); @benchmark $x * $y
BenchmarkTools.Trial: 10000 samples with 900 evaluations.
 Range (min … max):  124.147 ns …  1.115 μs  ┊ GC (min … max): 0.00% … 68.38%
 Time  (median):     132.959 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   150.138 ns ± 99.103 ns  ┊ GC (mean ± σ):  7.41% ±  9.60%

  █▄▂▂▁                                                        ▁
  ██████▇▇▆▆▆▆▆▄▁▁▁█▆▄▃▃▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▃▅▆▄▅▅▄▆▆▅▆ █
  124 ns        Histogram: log(frequency) by time       858 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 1000, 1000)); y = Tropical.(rand(T, 1000, 1000)); @benchmark $x * $y
BenchmarkTools.Trial: 70 samples with 1 evaluation.
 Range (min … max):  69.017 ms … 85.843 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     70.654 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   71.431 ms ±  2.633 ms  ┊ GC (mean ± σ):  0.17% ± 0.44%

     █                                                         
  ▄▆▄█▇█▃▇▄▆▁▇▄▆▃▄▁▆▆▁▄▃▁▄▁▁▄▁▃▃▁▁▃▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  69 ms           Histogram: frequency by time        79.8 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

After Precompiling

julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); @benchmark $x * $y
[ Info: Precompiling TropicalGEMM [a4ad3063-64a7-4bad-8738-34ed09bc0236]
BenchmarkTools.Trial: 10000 samples with 888 evaluations.
 Range (min … max):  127.241 ns … 786.229 ns  ┊ GC (min … max): 0.00% … 52.64%
 Time  (median):     134.123 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   148.204 ns ±  61.213 ns  ┊ GC (mean ± σ):  3.81% ±  7.61%

  █▇▄▃▂▂▂▂▂                                                     ▂
  ██████████▇▆▇▆▆▅▅▆▅▅▅▆▅▃▃▄▁▁▅█▆▆▃▄▃▁▃▁▄▃▃▁▃▄▃▄▁▅▅▅▄▅▄▅▆▅▄▅▅▄▅ █
  127 ns        Histogram: log(frequency) by time        557 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 1000, 1000)); y = Tropical.(rand(T, 1000, 1000)); @benchmark $x * $y
BenchmarkTools.Trial: 72 samples with 1 evaluation.
 Range (min … max):  68.306 ms … 73.864 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     68.889 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   69.528 ms ±  1.179 ms  ┊ GC (mean ± σ):  0.17% ± 0.44%

   ▃ █▅ ▂                                                      
  ▄█▄████▇▄▇▄▁▁▄▁▄▄▄▁▄▄▄▁▁▅▄▅▁▄▄▅▄▅▄▄▅▅▄▄▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▄ ▁
  68.3 ms         Histogram: frequency by time        72.7 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.
GiggleLiu commented 1 year ago

I used your script for benchmarking Octavian, but did not see any dispatch issue. The Julia version is 1.9.1, system version is Ubuntu 22.04, CPU is Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz

Without precompile,

julia> using BenchmarkTools

julia> using Octavian
[ Info: Precompiling Octavian [6fd5a793-0b7e-452c-907f-f8bfe9c57db4]
A^[[A
julia> A = rand(-1_000:1_000, 200, 200);

julia> B = rand(-1_000:1_000, 200, 200);

julia> C = similar(A);

julia> Af64 = Float64.(A); Bf64 = Float64.(B); Cf64 = similar(Af64);

julia> @benchmark matmul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  278.507 μs … 524.333 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     286.815 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   289.723 μs ±  14.144 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▃   ▅█▅▄▃▃▂▂▁▁                                               ▂
  ██▆▃▄███████████▇█▇▇▇▆▆█████▇▆▆▅▆▆▆▆▆▆▆▆▆▆▆▇▅▅▆▆▅▅▆▆▄▄▄▃▅▅▁▃▅ █
  279 μs        Histogram: log(frequency) by time        353 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

With precompile

julia> using Octavian
[ Info: Precompiling Octavian [6fd5a793-0b7e-452c-907f-f8bfe9c57db4]

julia> A = rand(-1_000:1_000, 200, 200);

julia> B = rand(-1_000:1_000, 200, 200);

julia> C = similar(A);

julia> using BenchmarkTools

julia> Af64 = Float64.(A); Bf64 = Float64.(B); Cf64 = similar(Af64);

julia> @benchmark matmul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  283.227 μs … 553.582 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     285.139 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   286.645 μs ±  12.436 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▅█▇█▆▅▄▄▃▃▃▃▂▂▂▁▁▁ ▁▁                                        ▂
  ███████████████████████▇██▇▆▇▇▇▇▆▆▅▅▆▅▄▄▅▄▃▄▅▃▃▄▅▅▃▅▅▃▄▃▁▄▄▅▃ █
  283 μs        Histogram: log(frequency) by time        309 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Precompiling script

@setup_workload begin
    # Putting some things in `@setup_workload` instead of `@compile_workload` can reduce the size of the
    # precompile file and potentially make loading faster.
    @compile_workload begin
        for T in (Float32, Float64)
            x = rand(T, 10, 10)
            y = rand(T, 10, 10)
            Octavian.matmul(x, y)
        end
    end
end
chriselrod commented 1 year ago

I was on Julia master. I hadn't seen the problem before in Octavian, but have seen it often enough elsewhere that I didn't want to spend any time on it, and just disabled the precompilation.