Closed GiggleLiu closed 1 year ago
Since this is an easy fix and the tests pass. I will merge this PR directly.
I forgot to remove it in https://github.com/JuliaLinearAlgebra/Octavian.jl/pull/181 Precompilation often results in dynamic dispatches in the precompiled code that goes away if you don't precompile it.
I can not reproduce the performance issue. I did not see dynamic dispatch in the benchmark so I will keep the precompilation for now. The bug in Octavian is really wield, I can benchmark it on my machine for you to see if it is CPU architecture dependent.
julia> using BenchmarkTools
julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); @benchmark $x * $y
BenchmarkTools.Trial: 10000 samples with 900 evaluations.
Range (min … max): 124.147 ns … 1.115 μs ┊ GC (min … max): 0.00% … 68.38%
Time (median): 132.959 ns ┊ GC (median): 0.00%
Time (mean ± σ): 150.138 ns ± 99.103 ns ┊ GC (mean ± σ): 7.41% ± 9.60%
█▄▂▂▁ ▁
██████▇▇▆▆▆▆▆▄▁▁▁█▆▄▃▃▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▃▅▆▄▅▅▄▆▆▅▆ █
124 ns Histogram: log(frequency) by time 858 ns <
Memory estimate: 896 bytes, allocs estimate: 1.
julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 1000, 1000)); y = Tropical.(rand(T, 1000, 1000)); @benchmark $x * $y
BenchmarkTools.Trial: 70 samples with 1 evaluation.
Range (min … max): 69.017 ms … 85.843 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 70.654 ms ┊ GC (median): 0.00%
Time (mean ± σ): 71.431 ms ± 2.633 ms ┊ GC (mean ± σ): 0.17% ± 0.44%
█
▄▆▄█▇█▃▇▄▆▁▇▄▆▃▄▁▆▆▁▄▃▁▄▁▁▄▁▃▃▁▁▃▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
69 ms Histogram: frequency by time 79.8 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); @benchmark $x * $y
[ Info: Precompiling TropicalGEMM [a4ad3063-64a7-4bad-8738-34ed09bc0236]
BenchmarkTools.Trial: 10000 samples with 888 evaluations.
Range (min … max): 127.241 ns … 786.229 ns ┊ GC (min … max): 0.00% … 52.64%
Time (median): 134.123 ns ┊ GC (median): 0.00%
Time (mean ± σ): 148.204 ns ± 61.213 ns ┊ GC (mean ± σ): 3.81% ± 7.61%
█▇▄▃▂▂▂▂▂ ▂
██████████▇▆▇▆▆▅▅▆▅▅▅▆▅▃▃▄▁▁▅█▆▆▃▄▃▁▃▁▄▃▃▁▃▄▃▄▁▅▅▅▄▅▄▅▆▅▄▅▅▄▅ █
127 ns Histogram: log(frequency) by time 557 ns <
Memory estimate: 896 bytes, allocs estimate: 1.
julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 1000, 1000)); y = Tropical.(rand(T, 1000, 1000)); @benchmark $x * $y
BenchmarkTools.Trial: 72 samples with 1 evaluation.
Range (min … max): 68.306 ms … 73.864 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 68.889 ms ┊ GC (median): 0.00%
Time (mean ± σ): 69.528 ms ± 1.179 ms ┊ GC (mean ± σ): 0.17% ± 0.44%
▃ █▅ ▂
▄█▄████▇▄▇▄▁▁▄▁▄▄▄▁▄▄▄▁▁▅▄▅▁▄▄▅▄▅▄▄▅▅▄▄▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▄ ▁
68.3 ms Histogram: frequency by time 72.7 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
I used your script for benchmarking Octavian, but did not see any dispatch issue. The Julia version is 1.9.1, system version is Ubuntu 22.04, CPU is Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
julia> using BenchmarkTools
julia> using Octavian
[ Info: Precompiling Octavian [6fd5a793-0b7e-452c-907f-f8bfe9c57db4]
A^[[A
julia> A = rand(-1_000:1_000, 200, 200);
julia> B = rand(-1_000:1_000, 200, 200);
julia> C = similar(A);
julia> Af64 = Float64.(A); Bf64 = Float64.(B); Cf64 = similar(Af64);
julia> @benchmark matmul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 278.507 μs … 524.333 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 286.815 μs ┊ GC (median): 0.00%
Time (mean ± σ): 289.723 μs ± 14.144 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▃ ▅█▅▄▃▃▂▂▁▁ ▂
██▆▃▄███████████▇█▇▇▇▆▆█████▇▆▆▅▆▆▆▆▆▆▆▆▆▆▆▇▅▅▆▆▅▅▆▆▄▄▄▃▅▅▁▃▅ █
279 μs Histogram: log(frequency) by time 353 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> using Octavian
[ Info: Precompiling Octavian [6fd5a793-0b7e-452c-907f-f8bfe9c57db4]
julia> A = rand(-1_000:1_000, 200, 200);
julia> B = rand(-1_000:1_000, 200, 200);
julia> C = similar(A);
julia> using BenchmarkTools
julia> Af64 = Float64.(A); Bf64 = Float64.(B); Cf64 = similar(Af64);
julia> @benchmark matmul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 283.227 μs … 553.582 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 285.139 μs ┊ GC (median): 0.00%
Time (mean ± σ): 286.645 μs ± 12.436 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▅█▇█▆▅▄▄▃▃▃▃▂▂▂▁▁▁ ▁▁ ▂
███████████████████████▇██▇▆▇▇▇▇▆▆▅▅▆▅▄▄▅▄▃▄▅▃▃▄▅▅▃▅▅▃▄▃▁▄▄▅▃ █
283 μs Histogram: log(frequency) by time 309 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
@setup_workload begin
# Putting some things in `@setup_workload` instead of `@compile_workload` can reduce the size of the
# precompile file and potentially make loading faster.
@compile_workload begin
for T in (Float32, Float64)
x = rand(T, 10, 10)
y = rand(T, 10, 10)
Octavian.matmul(x, y)
end
end
end
I was on Julia master. I hadn't seen the problem before in Octavian, but have seen it often enough elsewhere that I didn't want to spend any time on it, and just disabled the precompilation.
Before
After
This is amazing speed up, which greatly improves the usability of
Octavian
andTropicalGEMM
.@chriselrod I notice that in
Octavian
, PrecompileTools is imported but not used. Do you have special concerns like compatibility or binary size? Otherwise, I just confirmed the following speed up of TTFX after the precompilation:Before this change, the TTFX is 6s on my machine.
Update
The binary size is also "amazing":