Open FelixBenning opened 1 year ago
Patch coverage: 100.00
% and project coverage change: -3.71
:warning:
Comparison is base (
ef6d459
) 94.16% compared to head (387837f
) 90.45%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
I have no idea why the tests would fail - something about constant allocation heuristics in Zygote?
I have no idea why the tests would fail - something about constant allocation heuristics in Zygote?
Is it clear whether they're happening in areas where the modifications introduced here cause problems, or do they appear completely spurious?
they definitely appear completely spurious. The area seems completely unrelated. But I also don't understand the code in that area to be honest
I think similar definitions should be added in this PR for
RowVecs
for consistency.
If you hcat
RowVecs
you would expect something else
A = [ a11 ... a1n;
...
am1 ... amn]
RowVecs(A) == [ [a11 ... a1n], ... [am1, ... amn] ]
So
reduce(hcat, RowVecs(A)) == [
a11 ... am1
...
a1n ... 1mn
] # == A'
But if you have to transpose A'
you have to move all of the memory around, which is probably already equivalent to the existing reduce(hcat, ::AbstractVector)
That is why I did not add it (because I don't know if that would actually result in a performance benefit. RowVecs
definitely go against the grain of julias column major matrices.)
But if you believe this is worth the additional lines of code, I can add these definitions for RowVecs :)
But if you have to transpose A' you have to move all of the memory around, which is probably already equivalent to the existing reduce(hcat, ::AbstractVector)
It's defined here: https://github.com/JuliaLang/julia/blob/310f59019856749fb85bc56a1e3c2e0592a134ad/base/abstractarray.jl#L1638-L1701
Benchmarking would be nice but I would suspect that reduce(::typeof(hcat), x::RowVecs) = permutedims(x.X)
would be easier for the compiler. Similarly you could implement e.g. reduce(::typeof(vcat), x::RowVecs) = copy(vec(PermutedDimsArray(x.X, (2, 1))))
in a way that does not compute any actual transposes but only uses views.
RowVecs definitely go against the grain of julias column major matrices.
Many stats packages prefer this format though, I assume at least partly due to consistency with R, Python, and dataframes/datasets in general. Often it also doesn't matter what input format users use - internally you can always use the most performant algorithm.
they definitely appear completely spurious.
Just from looking at the failing tests and logs, I would assume that the test failures are real: They compare the allocations in the pullbacks of transforms that IMO are likely to involve reduce
for arguments of type ColVecs
and RowVecs
. So optimizing ColVecs
but not RowVecs
could likely explain the reduced allocation for ColVecs
(whereas allocations for RowVecs
are not affected).
This could be another argument for optimizing RowVecs
as well. But I assume that even optimized RowVecs
code might still lead to more allocations than ColVecs
.
How do you guys do benchmarks locally? If I want to use BenchmarkTools
I need to add the package (but I don't want to add it to KernelFunctions.jl). If I ] activate test
I don't have access to KernelFunctions anymore. ] test
works but it seems to do something special... I mean I can add BenchmarkTools I guess and just not commit that change. But that seems contrived
I often use temporary environments. Changing Project.toml and/or Manifest.toml for benchmarking is not a good idea IMO.
julia> @benchmark invoke(reduce, Tuple{typeof(hcat), AbstractVector}, hcat, ColVecs($D))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min âĶ max): 10.375 Ξs âĶ 10.470 ms â GC (min âĶ max): 0.00% âĶ 99.78%
Time (median): 18.125 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 29.337 Ξs Âą 269.440 Ξs â GC (mean Âą Ï): 33.74% Âą 3.89%
â âââ
ââââââââââ
ââââââââââââââââââââââââââââ
â
âââââââââââââââââââââââââââââââ â
10.4 Ξs Histogram: frequency by time 38.9 Ξs <
Memory estimate: 86.12 KiB, allocs estimate: 198.
julia> @benchmark invoke(reduce, Tuple{typeof(hcat), ColVecs}, hcat, ColVecs($D))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min âĶ max): 2.208 ns âĶ 23.917 ns â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 2.375 ns â GC (median): 0.00%
Time (mean Âą Ï): 2.388 ns Âą 0.593 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
â
âââââââââââââââââ
âââââââââ
ââââââââââââââââ
ââââââââââââââââ â
2.21 ns Histogram: frequency by time 2.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark invoke(reduce, Tuple{typeof(hcat), AbstractVector}, hcat, RowVecs($D))
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
Range (min âĶ max): 7.073 Ξs âĶ 1.815 ms â GC (min âĶ max): 0.00% âĶ 95.50%
Time (median): 11.677 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 16.796 Ξs Âą 73.342 Ξs â GC (mean Âą Ï): 27.95% Âą 6.60%
â âââââ
ââ
â
ââââ
âââââââââââââ
ââââââââââââââââââââââââââââââââââââââââââ â
7.07 Ξs Histogram: frequency by time 31.9 Ξs <
Memory estimate: 57.86 KiB, allocs estimate: 128.
julia> @benchmark invoke(reduce, Tuple{typeof(hcat), RowVecs}, hcat, RowVecs($D))
BenchmarkTools.Trial: 10000 samples with 247 evaluations.
Range (min âĶ max): 302.126 ns âĶ 121.268 Ξs â GC (min âĶ max): 0.00% âĶ 99.45%
Time (median): 679.486 ns â GC (median): 0.00%
Time (mean Âą Ï): 1.117 Ξs Âą 2.958 Ξs â GC (mean Âą Ï): 34.72% Âą 14.37%
â
â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
302 ns Histogram: frequency by time 16.1 Ξs <
Memory estimate: 4.81 KiB, allocs estimate: 1.
Interestingly it does help to implement the RowVersion, but it is still slower than the col version which just does nothing
Well, that's exactly what I expected, so I'm not surprised. (BTW I would directly benchmark the functions of interest to avoid measuring invoke
as well.)
To measure the performance of the fallback I would need to use invoke
no? And to make it fair it should use it everywhere. That was my reasoning at least.
You could just run the same benchmark on the master branch and on the PR.
That's also safer in so far as you don't have to reason about which methods are called - maybe there's some specialization for AbstractVector{<:Vector}
somewhere else that is used on the master branch instead of the method you benchmarked?
So if I understand correctly here the number of allocations are counted:
kernelmatrix (binary): Test Failed at /home/runner/work/KernelFunctions.jl/KernelFunctions.jl/test/test_utils.jl:387
Expression: pb[1] == pb[2]
Evaluated: 504 == 503
and since ColVecs
is more efficient now, it has one allocation less? I am not sure how to fix those tests though to be honest.
You could just run the same benchmark on the master branch and on the PR.
new
julia> @benchmark reduce(hcat, ColVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min âĶ max): 2.125 ns âĶ 81.958 ns â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 2.292 ns â GC (median): 0.00%
Time (mean Âą Ï): 2.343 ns Âą 0.997 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
â â â â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
2.12 ns Histogram: frequency by time 2.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark reduce(hcat, RowVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 932 evaluations.
Range (min âĶ max): 125.850 ns âĶ 5.302 Ξs â GC (min âĶ max): 0.00% âĶ 96.04%
Time (median): 142.212 ns â GC (median): 0.00%
Time (mean Âą Ï): 201.706 ns Âą 243.362 ns â GC (mean Âą Ï): 19.27% Âą 14.85%
ââ
âââââ â â
ââââââââââââ
â
â
âââââââââââââââââââââââââââââââââââââââââââââââ â
126 ns Histogram: log(frequency) by time 1.41 Ξs <
Memory estimate: 1.77 KiB, allocs estimate: 1.
vs old
julia> @benchmark reduce(hcat, ColVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 258 evaluations.
Range (min âĶ max): 296.349 ns âĶ 5.797 Ξs â GC (min âĶ max): 0.00% âĶ 92.52%
Time (median): 315.244 ns â GC (median): 0.00%
Time (mean Âą Ï): 333.317 ns Âą 142.009 ns â GC (mean Âą Ï): 3.90% Âą 7.70%
âââ
â â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
296 ns Histogram: log(frequency) by time 1.5 Ξs <
Memory estimate: 1.77 KiB, allocs estimate: 1.
julia> @benchmark reduce(hcat, RowVecs(data)) setup=(data=rand(10,20))
BenchmarkTools.Trial: 10000 samples with 343 evaluations.
Range (min âĶ max): 259.233 ns âĶ 5.354 Ξs â GC (min âĶ max): 0.00% âĶ 92.93%
Time (median): 276.239 ns â GC (median): 0.00%
Time (mean Âą Ï): 297.589 ns Âą 161.529 ns â GC (mean Âą Ï): 5.91% Âą 9.18%
ââ
â â
ââââââ
ââââ
âââââââââââââââââââââââââââââââââââââââââââââââââââ â
259 ns Histogram: log(frequency) by time 1.58 Ξs <
Memory estimate: 1.77 KiB, allocs estimate: 1.
Summary
The idea is that
should in principle be a no-op.
Proposed changes Specialization of
reduce
ontypeof(hcat)
andtypeof(vcat)
.The drawback is more code. But I added test coverage and it was something very easy to do ðĪ· . Whether you want this or not is up to you.