JuliaGaussianProcesses / KernelFunctions.jl

Julia package for kernel functions for machine learning
https://juliagaussianprocesses.github.io/KernelFunctions.jl/stable/
MIT License
267 stars 32 forks source link

ConstantKernel Performance #432

Closed willtebbutt closed 2 years ago

willtebbutt commented 2 years ago

Summary

I found a lovely pathological Zygote-related performance problem.

Proposed changes

Optimisations for kernelmatrix for the ConstantKernel

What alternatives have you considered?

The only other option is to improve Zygote's handling of map. Unfortunately I can't see Zygote improving here in the immediate future (I certainly don't have time to look into it right now).

Breaking changes

None

Performance

Here's a demo of how awful the performance is at the minute vs this PR:

using BenchmarkTools
using KernelFunctions
using Zygote

function foo(c, x, y)
    return kernelmatrix(ConstantKernel(c=c), x, y)
end

c = 1.0
x = randn(1_000)
y = randn(1_500)

@benchmark foo($c, $x, $y)
@benchmark Zygote.pullback(foo, $c, $x, $y)

out, pb = Zygote.pullback(foo, c, x, y)
@benchmark $pb($out)

# Without PR

# Primal:
BenchmarkTools.Trial: 956 samples with 1 evaluation.
 Range (min … max):  4.362 ms …   7.830 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.087 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.222 ms ± 515.433 μs  ┊ GC (mean ± σ):  6.09% ± 9.28%

       ▂           █▆▆▁▁
  ▂▂▃▄▆██▅▅▅▃▃▃▃▄▄▄█████▅▅▄▄▄▄▃▁▂▂▂▃▃▃▄▅▇▇▆▆▅▄▄▃▃▃▃▂▂▃▃▂▁▂▁▂▂ ▃
  4.36 ms         Histogram: frequency by time        6.56 ms <

 Memory estimate: 12.89 MiB, allocs estimate: 14.

# Forwards-pass:
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 12.314 s (51.96% GC) to evaluate,
 with a memory estimate of 4.51 GiB, over 97500049 allocations.

# Reverse-pass:
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 27.904 s (11.51% GC) to evaluate,
 with a memory estimate of 2.50 GiB, over 96000081 allocations.

# With PR:

# Primal
BenchmarkTools.Trial: 1449 samples with 1 evaluation.
 Range (min … max):  1.396 ms … 11.263 ms  ┊ GC (min … max):  0.00% … 67.42%
 Time  (median):     2.212 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.437 ms ±  2.456 ms  ┊ GC (mean ± σ):  43.52% ± 33.98%

       ▅█
  ██▅▃▃███▄▃▂▂▂▂▂▁▁▁▂▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▄▆▅▅▃▃▃▂▂▂▂▂ ▃
  1.4 ms         Histogram: frequency by time        8.73 ms <

 Memory estimate: 11.44 MiB, allocs estimate: 6.

# Forwards-pass
BenchmarkTools.Trial: 1285 samples with 1 evaluation.
 Range (min … max):  1.732 ms … 10.228 ms  ┊ GC (min … max):  0.00% … 79.74%
 Time  (median):     2.147 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.880 ms ±  3.025 ms  ┊ GC (mean ± σ):  44.58% ± 32.92%

    █▆▃▁                                            ▂▃▃▂▁▁
  ▇█████▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄▄▆███████▇ █
  1.73 ms      Histogram: log(frequency) by time     9.79 ms <

 Memory estimate: 11.45 MiB, allocs estimate: 98.

# Reverse-pass
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.841 μs … 164.260 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.554 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.567 μs ±   7.206 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▃▂▂▂▃▂▂ ▂▁▁ ▁ ▁▂ ▁▁▃▄▂▁                                   ▂
  ██████████████████████████▇▇▆▅▅▅▄▅▅▅▅▅▄▄▄▁▄▅▃▁▄▄▄▁▄▅▃▁▃▄▄▅▃▅ █
  23.8 μs       Histogram: log(frequency) by time      48.6 μs <

 Memory estimate: 2.61 KiB, allocs estimate: 107.

You get similar results for the two-argument method of kernelmatrix.

theogf commented 2 years ago

Can we test the performance in some reliable way to avoid regressions?

You would need to add tests on the benchmark.jl in master first.

willtebbutt commented 2 years ago

You would need to add tests on the benchmark.jl in master first.

Do I literally just need to add the ConstantKernel to line 19 of benchmarks.jl? Also, how is it that I trigger the benchmark suite?

willtebbutt commented 2 years ago

Or are the AD-related benchmarking tests not on master yet? They don't appear to use Zygote.

edit: sorry, I thought we had AD-related performance checks somewhere.

@devmotion @theogf would either of you mind if I add them in this PR? I guess we might as well start tracking this stuff.

codecov[bot] commented 2 years ago

Codecov Report

Merging #432 (0de2c43) into master (cbbf1d4) will decrease coverage by 0.14%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #432      +/-   ##
==========================================
- Coverage   93.14%   92.99%   -0.15%     
==========================================
  Files          52       52              
  Lines        1210     1214       +4     
==========================================
+ Hits         1127     1129       +2     
- Misses         83       85       +2     
Impacted Files Coverage Δ
src/basekernels/constant.jl 100.00% <100.00%> (ø)
src/chainrules.jl 85.18% <0.00%> (-2.47%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update cbbf1d4...0de2c43. Read the comment docs.

theogf commented 2 years ago

would either of you mind if I add them in this PR? I guess we might as well start tracking this stuff.

Sure but I am not sure if it works if you change the benchmark file in the PR before in the doing it in master

willtebbutt commented 2 years ago

Ahh okay. I'll merge as-is then, and address benchmarking later.

theogf commented 2 years ago

Docs fail because of https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/blob/cbbf1d4c540909eff95170c7a591472fe5dbd2ae/examples/gaussian-process-priors/script.jl#L52

It looks one cannot add I to a FillMatrix

[EDIT:] Using Eye instead would work but I don't think that we want that...

willtebbutt commented 2 years ago

Ahh okay. I'll just go with fill for now then -- it's still a big win.

theogf commented 2 years ago

I opened an issue in FillArrays https://github.com/JuliaArrays/FillArrays.jl/issues/170

devmotion commented 2 years ago

Ah, that's an unfortunate upstream bug it seems (with an easy fix probably?). I guess we could just use a full matrix or Eye in the example.

devmotion commented 2 years ago

It seems a bit unfortunate to not exploit the performance benefits, only because it breaks this example.

willtebbutt commented 2 years ago

I agree that it's a shame, but I'm still inclined to merge when CI passes, unless there are strong objections. We can always improve once stuff is fixed in FillArrays.

theogf commented 2 years ago

I made an issue to remember.

devmotion commented 2 years ago

No strong objections, just a bit sad :cry: I'm looking forward to changing it to Fill (maybe open an issue?), it seems it would be a drastic improvement (2ms vs 37ns in the foo benchmark, and 2ms vs 5μs in Zygote.pullback) :slightly_smiling_face: