Closed racinmat closed 3 years ago
Yes that would be awesome. Somewhat like Zeros, we can have a method dispatch to get the correct columns from the hot neuron.
Could you elaborate more what exactly do you have in mind?
@racinmat did you submit a PR for this?
Not yet, but I plan to do it.
I think this is resolved by https://github.com/FluxML/Flux.jl/pull/1448 and https://github.com/FluxML/Flux.jl/pull/1424?
In the end it looks like it has not been resolved, here is benchmark: https://github.com/racinmat/flux_benchmarks/blob/master/results_slurm_1.md I'll make new PR for it.
What does the tricks imply there?
That's the faster implementation: https://github.com/racinmat/flux_benchmarks/blob/master/0_12_7_tricks/main.jl#L6-L28 although I guess I should modify it for the PR so it would dispatch only on OneHotArray of dimension 2, right?
The problem with the linked implementations is that they are decidedly not GPU-friendly. Given that the current code path is literally a property lookup and vectorized index, I think it would be fruitful to profile that first and see where the bottlenecks are.
That's true, we would need to dispatch to different implementation for CuArrays, right? It would definitely be fuitful to see why current implementations are that slow.
Well, that's the thing. The current implementation should be more than fast enough since it theoretically does less work than the custom dense CPU array only version with a loop. That it doesn't warrants an investigation. Ideally, we'd like to avoid writing custom kernels for something so trivial.
Right, maybe the we can find out where the bottleneck is in the current implementation instead
I might have a culprit, https://github.com/FluxML/Flux.jl/blob/master/src/onehot.jl#L229 dispatches on OneHotVector
, not OneHotMatrix
, and multiplication by OneHotMatrix
defaults to https://github.com/JuliaLang/julia/blob/v1.6.3/stdlib/LinearAlgebra/src/matmul.jl#L151-L154
I think multiplication by onehot can be expressed as a gather operation, for which we have cuda kernels
I might have a culprit, https://github.com/FluxML/Flux.jl/blob/master/src/onehot.jl#L229 dispatches on
OneHotVector
, notOneHotMatrix
, and multiplication byOneHotMatrix
defaults to https://github.com/JuliaLang/julia/blob/v1.6.3/stdlib/LinearAlgebra/src/matmul.jl#L151-L154
The OneHotMatrix
path uses https://github.com/FluxML/Flux.jl/blob/master/src/onehot.jl#L222, not the base fallback.
Edit: this (and I) were wrong, see post below.
I just noticed the benchmark script doesn't use interpolation for the global variables. Here are the results on my local machine:
julia> @btime $x*$y;
6.848 μs (3 allocations: 40.02 KiB)
julia> @btime fast_mul($x, $y);
3.010 μs (2 allocations: 39.14 KiB)
julia> @btime $x*$y';
738.998 μs (8 allocations: 39.47 KiB)
julia> @btime fast_mul($x, $y');
4.405 μs (2 allocations: 39.14 KiB)
Where fast_mul
is just the mul implementation linked above as a standalone function.
Edit: mea culpa, here is the timing for x*y' after treating it like a wrapper type in https://github.com/FluxML/Flux.jl/blob/master/src/onehot.jl#L31-L33:
julia> @btime $x*$y';
30.439 μs (19 allocations: 42.25 KiB)
Seems fine to me. Maybe we need to make the wrapper types transparent, but that's a Julia compiler thing. We should however make sure not to hit generic mul.
Best not to hack in mul with gather.
So how do we want to solve it? Should I play with the OneHotVector multiplication and extend it to multiplication by OneHotMatrix? By the way, I found out currently we can't multiply adjoint vector by onehot matrix:
julia> v = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
1
2
3
4
5
julia> b2 = Flux.OneHotMatrix([2, 4, 1, 3], 5)
5×4 OneHotMatrix(::Vector{Int64}) with eltype Bool:
⋅ ⋅ 1 ⋅
1 ⋅ ⋅ ⋅
⋅ ⋅ ⋅ 1
⋅ 1 ⋅ ⋅
⋅ ⋅ ⋅ ⋅
julia> v' * b2
ERROR: LoadError: MethodError: *(::Adjoint{Int64, Vector{Int64}}, ::OneHotArray{Int64, 5, 1, 2, Vector{Int64}}) is ambiguous. Candidates:
*(A::AbstractMatrix{T} where T, B::Union{OneHotArray{var"#s157", L, N, var"N+1", I}, Base.ReshapedArray{Bool, var"N+1", var"#s1571", MI} where {var"#s1571"<:(OneHotArray{var"#s157", L, var"#s156", var"#s155", I} where {var"#s156", var"#s155"}), MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, N} where N}}} where {var"#s157", N, var"N+1", I}) where L in Flux at E:\Projects\others_code\Flux.jl\src\onehot.jl:223
*(x::Adjoint{T, var"#s832"} where {T, var"#s832"<:(AbstractVector{T} where T)}, A::AbstractMatrix{T} where T) in LinearAlgebra at C:\Users\Azathoth\AppData\Local\Programs\Julia-1.6.0\share\julia\stdlib\v1.6\LinearAlgebra\src\matmul.jl:133
Possible fix, define
*(::Adjoint{T, var"#s832"} where {T, var"#s832"<:(AbstractVector{T} where T)}, ::Union{Base.ReshapedArray{Bool, 2, var"#s1571", MI} where {var"#s157", I, var"#s1571"<:(OneHotArray{var"#s157", L, var"#s156", var"#s155", I} where {var"#s156", var"#s155"}), MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, N} where N}}, OneHotArray{var"#s157", L, N, 2, I} where {var"#s157", N, I}}) where L
Which cases are we benchmarking? Can't find the script
Il lun 25 ott 2021, 21:12 Matěj Račinský @.***> ha scritto:
So how do we want to solve it? Should I play with the OneHotVector multiplication and extend it to multiplication by OneHotMatrix? By the way, I found out currently we can't multiply adjoint vector by onehot matrix:
julia> v = [1, 2, 3, 4, 5] 5-element Vector{Int64}:
1
2
3
4
5
julia> b2 = Flux.OneHotMatrix([2, 4, 1, 3], 5) 5×4 OneHotMatrix(::Vector{Int64}) with eltype Bool:
⋅ ⋅ 1 ⋅
1 ⋅ ⋅ ⋅
⋅ ⋅ ⋅ 1
⋅ 1 ⋅ ⋅
⋅ ⋅ ⋅ ⋅
julia> v' * b2
ERROR: LoadError: MethodError: *(::Adjoint{Int64, Vector{Int64}}, ::OneHotArray{Int64, 5, 1, 2, Vector{Int64}}) is ambiguous. Candidates:
*(A::AbstractMatrix{T} where T, B::Union{OneHotArray{var"#s157", L, N, var"N+1", I}, Base.ReshapedArray{Bool, var"N+1", var"#s1571", MI} where {var"#s1571"<:(OneHotArray{var"#s157", L, var"#s156", var"#s155", I} where {var"#s156", var"#s155"}), MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, N} where N}}} where {var"#s157", N, var"N+1", I}) where L in Flux at E:\Projects\others_code\Flux.jl\src\onehot.jl:223
*(x::Adjoint{T, var"#s832"} where {T, var"#s832"<:(AbstractVector{T} where T)}, A::AbstractMatrix{T} where T) in LinearAlgebra at C:\Users\Azathoth\AppData\Local\Programs\Julia-1.6.0\share\julia\stdlib\v1.6\LinearAlgebra\src\matmul.jl:133
Possible fix, define
*(::Adjoint{T, var"#s832"} where {T, var"#s832"<:(AbstractVector{T} where T)}, ::Union{Base.ReshapedArray{Bool, 2, var"#s1571", MI} where {var"#s157", I, var"#s1571"<:(OneHotArray{var"#s157", L, var"#s156", var"#s155", I} where {var"#s156", var"#s155"}), MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, N} where N}}, OneHotArray{var"#s157", L, N, 2, I} where {var"#s157", N, I}}) where L
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/FluxML/Flux.jl/issues/1355#issuecomment-951226831, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVQOQTBAUPIEAT72HG3DIDUIWTY5ANCNFSM4SR3Z5PA .
Sorry, that was my attempt to modify https://github.com/FluxML/Flux.jl/blob/master/test/onehot.jl#L28-L42 testcase to test multiplication by OneHotMatrix instead of OneHotVector, because I noticed Flux covers multiplication by OneHotVector, but not by matrices.
But getting back to my benchmarks and benchmarking the original code:
julia> @btime onecold($y);
108.923 ns (1 allocation: 896 bytes)
julia> idx = onecold(y);
julia> @btime $x[:, $idx];
6.500 μs (2 allocations: 39.14 KiB)
it seems the vectorized indexing is twice as slow as the fast_mul
I proposed.
And regarding the $x*$y'
I don't know how to implement it just using onecold
without the summation. And this is almost 6x slower than the proposed version. I recognize the proposed version is gpu-unfriendly, but I don't know how to write it in GPU-friendly way other than dispatching to some GPU-friendly version for GPU and keeping the GPU-unfriendly for CPU.
Or is the speedup not worth the specialized implementation?
But getting back to my benchmarks and benchmarking the original code:
I mean these, where are these benchmarks?
Sorry, it's these https://github.com/racinmat/flux_benchmarks/blob/master/0_12_7/main.jl https://github.com/racinmat/flux_benchmarks/blob/master/0_12_7_tricks/main.jl I compare more versions there, but these two are latest Flux with and without the multiplication optimized for cpu.
And it seems the vectorized index dispatches on https://github.com/JuliaLang/julia/blob/v1.6.0/base/abstractarray.jl#L1167-L1171, so I have no idea what could be profiled or improved there.
On my laptop with this script
using Flux, CUDA, LinearAlgebra, BenchmarkTools, NNlib, NNlibCUDA
using Flux: onehotbatch
function mul0(A::AbstractMatrix, B::Flux.OneHotMatrix)
A * B
end
function mul1(A::AbstractMatrix, B::Flux.OneHotMatrix)
m = size(A,1)
Y = similar(A, m, size(B,2))
for (j, ix) in enumerate(B.indices)
for i in 1:m
@inbounds Y[i,j] = A[i,ix]
end
end
Y
end
function mul2(A::AbstractMatrix, B::Flux.OneHotMatrix)
NNlib.gather(A, B.indices)
end
bs = 128;
Din = 100;
Dout = Din;
A = rand(Float32, Dout, Din);
oh = onehotbatch(rand(1:Din, bs), 1:Din);
@assert mul0(A,oh) == mul1(A,oh) == mul2(A,oh)
println("# mul0")
@btime mul0($A, $oh);
println("# mul1")
@btime mul1($A, $oh);
println("# mul2")
@btime mul2($A, $oh);
I get
# mul0
11.135 μs (3 allocations: 51.22 KiB)
# mul1
2.341 μs (2 allocations: 50.08 KiB)
# mul2
3.337 μs (2 allocations: 50.08 KiB)
Notice that your implementation (mul1
) is very similar to the one for cpu gather, likely the performance difference is due to the use of @inbounds
.
With those sizes I see the same timings for mul0
and mul2
on cpu and gpu, but for larger sizes you get a noticeable speedup
bs = 512;
Din = 1000;
Dout = Din;
A = rand(Float32, Dout, Din);
oh = onehotbatch(rand(1:Din, bs), 1:Din);
@assert mul0(A,oh) == mul1(A,oh) == mul2(A,oh)
println("# mul0")
@btime mul0($A, $oh);
println("# mul1")
@btime mul1($A, $oh);
println("# mul2")
@btime mul2($A, $oh);
gA, goh = A |> gpu, oh |> gpu;
println("# gpu mul0")
@btime mul0($gA, $goh);
println("# gpu mul1")
@btime mul1($gA, $goh);
println("# gpu mul2")
@btime mul2($gA, $goh);
# mul0
404.782 μs (3 allocations: 1.96 MiB)
# mul1
138.125 μs (2 allocations: 1.95 MiB)
# mul2
144.124 μs (2 allocations: 1.95 MiB)
# gpu mul0
11.010 μs (65 allocations: 2.86 KiB)
# gpu mul1
7.243 s (3073541 allocations: 500.25 MiB)
# gpu mul2
4.128 μs (36 allocations: 1.58 KiB)
Well mul1 isn't really meant to be run on GPUs so that's unfair to test with GPUs, and otherwise we try to remain generic.
If the mul2
is significantly faster for both CPU and GPU, is there a good reason against adding it?
It seems quite generic, and the speedup seems quite significant to me.
I would like to have such fast multiplications in Flux.
Or is there a better place where to have them?
We have a bug with gradient of mul0
(our current implementation for onehot multiplication). With arguments on gpu it returns cpu arrays:
julia> gradient(A -> sum(mul2(A, goh)), gA)[1] |> typeof
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
julia> gradient(A -> sum(mul0(A, goh)), gA)[1] |> typeof
Matrix{Float32} (alias for Array{Float32, 2})
Also, it is very slow (on gpu)
bs = 100;
Din = 10;
Dout = Din;
A = rand(Float32, Dout, Din);
oh = onehotbatch(rand(1:Din, bs), 1:Din);
@assert mul0(A,oh) == mul1(A,oh) == mul2(A,oh)
println("# mul0")
@btime mul0($A, $oh);
println("# mul1")
@btime mul1($A, $oh);
println("# mul2")
@btime mul2($A, $oh);
gA, goh = A |> gpu, oh |> gpu;
println("# gpu mul0")
@btime mul0($gA, $goh);
println("# gpu mul1")
@btime mul1($gA, $goh);
println("# gpu mul2")
@btime mul2($gA, $goh);
grad0 = gradient(A -> sum(mul0(A, oh)), A)[1]
gradg0 = gradient(A -> sum(mul0(A, goh)), gA)[1]
@assert Array(gradg0) ≈ grad0
grad2 = gradient(A -> sum(mul2(A, oh)), A)[1]
gradg2 = gradient(A -> sum(mul2(A, goh)), gA)[1]
@assert grad2 ≈ grad0
@assert Array(gradg2) ≈ grad2
println("# grad mul0")
@btime gradient(A -> sum(mul0(A, $oh)), $A)[1]
# println("# grad mul1") # errors out since mutates
# @btime gradient(A -> sum(mul1(A, oh)), A)[1]
println("# grad mul2")
@btime gradient(A -> sum(mul2(A, $oh)), $A)[1]
println("# grad gpu mul0")
@btime gradient(A -> sum(mul0(A, $goh)), $gA)[1]
# println("# grad mul1") # errors out since mutates
# @btime gradient(A -> sum(mul1(A, oh)), A)[1]
println("# grad gpu mul2")
@btime gradient(A -> sum(mul2(A, $goh)), $gA)[1]
# mul0
1.196 μs (2 allocations: 4.94 KiB)
# mul1
623.688 ns (1 allocation: 4.06 KiB)
# mul2
1.736 μs (1 allocation: 4.06 KiB)
# gpu mul0
11.530 μs (64 allocations: 2.84 KiB)
# gpu mul1
13.958 ms (6305 allocations: 1.03 MiB)
# gpu mul2
4.422 μs (32 allocations: 1.52 KiB)
# grad mul0
20.340 μs (29 allocations: 14.73 KiB)
# grad mul2
45.649 μs (521 allocations: 43.64 KiB)
# grad gpu mul0
14.573 ms (6172 allocations: 1009.88 KiB)
# grad gpu mul2
50.587 μs (146 allocations: 7.94 KiB)
If the mul2 is significantly faster for both CPU and GPU, is there a good reason against adding it?
No, we should do it, it also fixes the gpu bug above. Would you like to file a PR?
Yes, I'll make a PR.
Yeah, this mixes the meaning of mul with that of gather. Fixing the performance of GPU adjoints should be the fix in this case.
Yeah, this mixes the meaning of mul with that of gather. Fixing the performance of GPU adjoints should be the fix in this case.
This comment doesn't make sense. Have you seen the current definition of onehot's mul? What would you say, it mixes the meaning of getindex and mul? For onehot's matrices efficient implementations are index/gather operation, and the one which is faster, correct, and support gpu should be selected
I agree on choosing the correct definition. And indexing and gather are different enough (one can gather over different dimensions whereas matmul is a standard) that I think it's best to keep the mul.
While the discussion thus far has been about x*y
and a previous PR implemented fast paths for x'*y
, the bigger gap in @racinmat's benchmarks is actually x*y'
. This isn't caught by the dispatch at https://github.com/FluxML/Flux.jl/blob/master/src/onehot.jl#L222 and also can't be expressed with gather
because indices are "repeated" in the transposed one-hot matrix, but can be approximated with scatter
:
julia> @btime $x*$y';
720.190 μs (8 allocations: 39.47 KiB)
julia> @btime fast_mul($x, $y');
4.242 μs (2 allocations: 39.14 KiB)
julia> @btime Flux.NNlib.scatter!(+, zeros(Float32, 100, 100), $x', $y.indices);
46.070 μs (503 allocations: 71.97 KiB)
Where fast_mul
is adapted from the linked benchmark:
function fast_mul(A::AbstractMatrix, B::Adjoint{Bool,<: Flux.OneHotArray})
m = size(A,1)
Y = fill!(similar(A, m, size(B,2)), zero(eltype(A)))
for (j, ix) in enumerate(parent(B).indices)
for i in 1:m
@inbounds Y[i,ix] += A[i,j]
end
end
Y
end
It's not altogether clear to me why fast_mul
is, well, so much faster. Collecting x'
into a separate variable or using (.+)
instead of +
don't seem to speed up scatter
at all. Moreover, even if I remove the @inbounds
from fast_mul
, it's still ~4x faster:
julia> @btime fast_mul_checked($x, $y');
11.549 μs (2 allocations: 39.14 KiB)
So unless we can diagnose why scatter[!]
is slower, this seems like a good candidate for inclusion into NNlib
Ah thanks Brian! I was blanking on the exact case re gather, it was so simple in hindsight. Approximating with scatter seems like it would lead to similar corner cases. I think the best answer is to see if the current implementation (and the adjoint) is running into a generic fallback somewhere
For the current definition of scatter
? Not that I can tell, unfortunately. The llvm and native asm are different, but not substantially so and you see many of the same paths invoked (e.g. vectorized setindex
). I'm afraid I don't have a good enough sense of the performance breakdown to understand where the difference.
The current definition of *
, on the other hand, is 100% falling back to the generic matmul in Base. Since the one-hot array is technically sparse, I don't think we can lean on BLAS for this either. Thankfully, the hand-written implementation is pretty concise and could easily be translated into a GPU kernel as well (there is 0 conditional logic). I only wish we could merge both routines in https://github.com/racinmat/flux_benchmarks/blob/master/0_12_7_tricks/main.jl#L7-L28 somehow, but after looking at the inner loops any similarities are likely misleading.
We can implement immediately
function Base.(:*)(A::AbstractMatrix, B::Adjoint{Bool, <:Flux.OneHotMatrix)
NNlib.scatter(+, A, parent(B).indices, dstsize(size(A,1), size(B,2))
end
to obtain a noticeable cpu and gpu speedup.
Since the mul0
and fastmul
implementations here are essentially implementations of scatter
and gather
with very tiny differences from the ones we have in NNlib, the fact that they are a bit faster means we have room for improving NNlib.gather and NNlib.scatter cpu implementation. So let's also add scatter
version for the onehot adjoint mul and bring the performance discussion to NNlib so that the benefits can be much more widespread
Multiplication of OneHotMatrix by dense layer could be more optimized, e.g. by
should I make PR for this?