Improve Performance of `reduce(hcat, ::ColVecs)` and `reduce(vcat, ::ColVecs)` by specialization

FelixBenning commented 1 year ago

Summary

The idea is that

reduce(hcat, ColVecs(X))

should in principle be a no-op.

Proposed changes Specialization of reduce on typeof(hcat) and typeof(vcat).

The drawback is more code. But I added test coverage and it was something very easy to do 🤷 . Whether you want this or not is up to you.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: -3.71 :warning:

Comparison is base (ef6d459) 94.16% compared to head (387837f) 90.45%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #510 +/- ## ========================================== - Coverage 94.16% 90.45% -3.71% ========================================== Files 52 52 Lines 1387 1393 +6 ========================================== - Hits 1306 1260 -46 - Misses 81 133 +52 ``` | [Impacted Files](https://app.codecov.io/gh/JuliaGaussianProcesses/KernelFunctions.jl/pull/510?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaGaussianProcesses) | Coverage Δ | | |---|---|---| | [src/utils.jl](https://app.codecov.io/gh/JuliaGaussianProcesses/KernelFunctions.jl/pull/510?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaGaussianProcesses#diff-c3JjL3V0aWxzLmps) | `92.04% <100.00%> (+0.58%)` | :arrow_up: | ... and [8 files with indirect coverage changes](https://app.codecov.io/gh/JuliaGaussianProcesses/KernelFunctions.jl/pull/510/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaGaussianProcesses)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

FelixBenning commented 1 year ago

I have no idea why the tests would fail - something about constant allocation heuristics in Zygote?

willtebbutt commented 1 year ago

I have no idea why the tests would fail - something about constant allocation heuristics in Zygote?

Is it clear whether they're happening in areas where the modifications introduced here cause problems, or do they appear completely spurious?

FelixBenning commented 1 year ago

they definitely appear completely spurious. The area seems completely unrelated. But I also don't understand the code in that area to be honest

FelixBenning commented 1 year ago

I think similar definitions should be added in this PR for RowVecs for consistency.

If you hcat RowVecs you would expect something else

A = [ a11 ... a1n;
      ...
      am1 ... amn]
RowVecs(A) == [ [a11 ... a1n], ... [am1, ... amn] ]

So

reduce(hcat, RowVecs(A)) == [
    a11 ... am1
    ...
    a1n ... 1mn
] # == A'

But if you have to transpose A' you have to move all of the memory around, which is probably already equivalent to the existing reduce(hcat, ::AbstractVector)

That is why I did not add it (because I don't know if that would actually result in a performance benefit. RowVecs definitely go against the grain of julias column major matrices.)

But if you believe this is worth the additional lines of code, I can add these definitions for RowVecs :)

devmotion commented 1 year ago

But if you have to transpose A' you have to move all of the memory around, which is probably already equivalent to the existing reduce(hcat, ::AbstractVector)

It's defined here: https://github.com/JuliaLang/julia/blob/310f59019856749fb85bc56a1e3c2e0592a134ad/base/abstractarray.jl#L1638-L1701 Benchmarking would be nice but I would suspect that reduce(::typeof(hcat), x::RowVecs) = permutedims(x.X) would be easier for the compiler. Similarly you could implement e.g. reduce(::typeof(vcat), x::RowVecs) = copy(vec(PermutedDimsArray(x.X, (2, 1)))) in a way that does not compute any actual transposes but only uses views.

RowVecs definitely go against the grain of julias column major matrices.

Many stats packages prefer this format though, I assume at least partly due to consistency with R, Python, and dataframes/datasets in general. Often it also doesn't matter what input format users use - internally you can always use the most performant algorithm.

devmotion commented 1 year ago

they definitely appear completely spurious.

Just from looking at the failing tests and logs, I would assume that the test failures are real: They compare the allocations in the pullbacks of transforms that IMO are likely to involve reduce for arguments of type ColVecs and RowVecs. So optimizing ColVecs but not RowVecs could likely explain the reduced allocation for ColVecs (whereas allocations for RowVecs are not affected).

This could be another argument for optimizing RowVecs as well. But I assume that even optimized RowVecs code might still lead to more allocations than ColVecs.

FelixBenning commented 1 year ago

How do you guys do benchmarks locally? If I want to use BenchmarkTools I need to add the package (but I don't want to add it to KernelFunctions.jl). If I ] activate test I don't have access to KernelFunctions anymore. ] test works but it seems to do something special... I mean I can add BenchmarkTools I guess and just not commit that change. But that seems contrived

devmotion commented 1 year ago

I often use temporary environments. Changing Project.toml and/or Manifest.toml for benchmarking is not a good idea IMO.

FelixBenning commented 1 year ago

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), AbstractVector},  hcat, ColVecs($D))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.375 μs …  10.470 ms  ┊ GC (min … max):  0.00% … 99.78%
 Time  (median):     18.125 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   29.337 μs ± 269.440 μs  ┊ GC (mean ± σ):  33.74% ±  3.89%

    ▁       ▂▄▅▄▆▆▇█▆▆▄▂▁                                       
  ▃▆███▇▆▆▇████████████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▂▂ ▄
  10.4 μs         Histogram: frequency by time         38.9 μs <

 Memory estimate: 86.12 KiB, allocs estimate: 198.

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), ColVecs},  hcat, ColVecs($D))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.208 ns … 23.917 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.375 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.388 ns ±  0.593 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                   █                          
  ▂▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁█▅▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂ ▂
  2.21 ns        Histogram: frequency by time         2.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), AbstractVector},  hcat, RowVecs($D))
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):   7.073 μs …  1.815 ms  ┊ GC (min … max):  0.00% … 95.50%
 Time  (median):     11.677 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   16.796 μs ± 73.342 μs  ┊ GC (mean ± σ):  27.95% ±  6.60%

   ▁     ▄▄▁▂▅█▅▂                                              
  ▆██▅▄▄██████████▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂ ▃
  7.07 μs         Histogram: frequency by time        31.9 μs <

 Memory estimate: 57.86 KiB, allocs estimate: 128.

julia> @benchmark invoke(reduce, Tuple{typeof(hcat), RowVecs},  hcat, RowVecs($D))
BenchmarkTools.Trial: 10000 samples with 247 evaluations.
 Range (min … max):  302.126 ns … 121.268 μs  ┊ GC (min … max):  0.00% … 99.45%
 Time  (median):     679.486 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.117 μs ±   2.958 μs  ┊ GC (mean ± σ):  34.72% ± 14.37%

   █                                                             
  ▅█▆▂▂▂▂▂▂▁▂▂▂▂▂▂▁▁▂▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▁▂▂▁▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  302 ns           Histogram: frequency by time         16.1 μs <

 Memory estimate: 4.81 KiB, allocs estimate: 1.

Interestingly it does help to implement the RowVersion, but it is still slower than the col version which just does nothing

devmotion commented 1 year ago

Well, that's exactly what I expected, so I'm not surprised. (BTW I would directly benchmark the functions of interest to avoid measuring invoke as well.)

FelixBenning commented 1 year ago

To measure the performance of the fallback I would need to use invoke no? And to make it fair it should use it everywhere. That was my reasoning at least.

devmotion commented 1 year ago

You could just run the same benchmark on the master branch and on the PR.

devmotion commented 1 year ago

That's also safer in so far as you don't have to reason about which methods are called - maybe there's some specialization for AbstractVector{<:Vector} somewhere else that is used on the master branch instead of the method you benchmarked?

FelixBenning commented 1 year ago

So if I understand correctly here the number of allocations are counted:

      kernelmatrix (binary): Test Failed at /home/runner/work/KernelFunctions.jl/KernelFunctions.jl/test/test_utils.jl:387
  Expression: pb[1] == pb[2]
   Evaluated: 504 == 503

and since ColVecs is more efficient now, it has one allocation less? I am not sure how to fix those tests though to be honest.

FelixBenning commented 1 year ago

You could just run the same benchmark on the master branch and on the PR.

new

julia> @benchmark reduce(hcat, ColVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.125 ns … 81.958 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.292 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.343 ns ±  0.997 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▃     █      ▁     ▁                     
  ▂▁▁▁▁▁▂▁▁▁▁▁▃▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▄▁▁▁▁▁▃▁▁▁▁▁▂ ▂
  2.12 ns        Histogram: frequency by time         2.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(hcat, RowVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 932 evaluations.
 Range (min … max):  125.850 ns …   5.302 μs  ┊ GC (min … max):  0.00% … 96.04%
 Time  (median):     142.212 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   201.706 ns ± 243.362 ns  ┊ GC (mean ± σ):  19.27% ± 14.85%

  █▅▃▁▄▄▃ ▁                                                     ▁
  ██████████▆▅▅▅▃▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▆▇████ █
  126 ns        Histogram: log(frequency) by time       1.41 μs <

 Memory estimate: 1.77 KiB, allocs estimate: 1.

vs old

julia> @benchmark reduce(hcat, ColVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 258 evaluations.
 Range (min … max):  296.349 ns …   5.797 μs  ┊ GC (min … max): 0.00% … 92.52%
 Time  (median):     315.244 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   333.317 ns ± 142.009 ns  ┊ GC (mean ± σ):  3.90% ±  7.70%

  █▇▅▁                                                          ▁
  ████▇██▆▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃ █
  296 ns        Histogram: log(frequency) by time        1.5 μs <

 Memory estimate: 1.77 KiB, allocs estimate: 1.

julia> @benchmark reduce(hcat, RowVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 343 evaluations.
 Range (min … max):  259.233 ns …   5.354 μs  ┊ GC (min … max): 0.00% … 92.93%
 Time  (median):     276.239 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   297.589 ns ± 161.529 ns  ┊ GC (mean ± σ):  5.91% ±  9.18%

  █▅▁                                                           ▁
  ███▇▆▅▇█▆▅▃▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▆▆ █
  259 ns        Histogram: log(frequency) by time       1.58 μs <

 Memory estimate: 1.77 KiB, allocs estimate: 1.

JuliaGaussianProcesses / KernelFunctions.jl

Improve Performance of `reduce(hcat, ::ColVecs)` and `reduce(vcat, ::ColVecs)` by specialization #510

Codecov Report