Preallocation in activations

lruthotto commented 6 years ago

Maybe we should write two versions of activation functions (and other ones). One that does allocation and one that operates in place. See the following example that Eran and I have put together.

Y = randn(10*768,2*512);
dY = zeros(size(Y));

function myTanhActivation!{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
for k=1:length(A)
    @inbounds A[k] = tanh(A[k])
    @inbounds dA[k] = 1-A[k]^2
end
return A,dA
end

function myTanhActivation{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
return myTanhActivation!(copy(A),copy(dA),doDerivative)
end

function myReluActivation!(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
for k=1:length(Y)
    @inbounds Y[k] = max(Y[k],0.0);
    @inbounds dA[k] = sign(Y[k]);
end
return Y,dA
end

function myReluActivation(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
return myReluActivation!(copy(Y),copy(Y),doDerivative)
end

X = copy(Y);
t1 = myTanhActivation!(X,X,true);
t2 = myTanhActivation(X,X,true);
t1 = myReluActivation!(X,X,true);
t2 = myReluActivation(X,X,true);
t1 = [];
t2 = [];
gc();

@time for k=1:10; t2 = myTanhActivation(Y,dY,true); end
@time for k=1:10; t2 = myTanhActivation!(Y,dY,true); end
@time for k=1:10; t2 = myReluActivation(Y,dY,true); end
@time for k=1:10; t2 = myReluActivation!(Y,dY,true); end

klensink commented 6 years ago

I really like this, there are lots of places we can be doing things in place and save some GC time.

This, and the sparse kernels, are also a really good place to play around with mulithreading the for loops. I'll try to whip up an example before the call today, but I tested it quickly yesterday and got great results

klensink commented 6 years ago

Multithreading example:

using BenchmarkTools

Y = randn(10*768,2*512);
dY = zeros(size(Y));

function myTanhActivation_mt!{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
    Threads.@threads for k=1:length(A)
        @inbounds A[k] = tanh(A[k])
        @inbounds dA[k] = 1-A[k]^2
    end  
    return A,dA 
end

function myTanhActivation!{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
    for k=1:length(A)
        @inbounds A[k] = tanh(A[k])
        @inbounds dA[k] = 1-A[k]^2
    end  
    return A,dA 
end

function myTanhActivation{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
    return myTanhActivation!(copy(A),copy(dA),doDerivative)
end

function myReluActivation!(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
    for k=1:length(Y)
        @inbounds Y[k] = max(Y[k],0.0);
        @inbounds dA[k] = sign(Y[k]);
    end  
    return Y,dA 
end

function myReluActivation_mt!(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
    Threads.@threads for k=1:length(Y)
        @inbounds Y[k] = max(Y[k],0.0);
        @inbounds dA[k] = sign(Y[k]);
    end  
    return Y,dA 
end

function myReluActivation(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
    return myReluActivation!(copy(Y),copy(Y),doDerivative)
end

t1 = @benchmark myTanhActivation!($Y, $dY, true) 
t2 = @benchmark myTanhActivation_mt!($Y, $dY, true) 

r1 = @benchmark myReluActivation!($Y, $dY, true) 
r2 = @benchmark myReluActivation_mt!($Y, $dY, true) 

println("--- Tanh ---")
display(judge(median(t2), median(t1)))
println("--- RELU ---")
display(judge(median(r2), median(r1)))

Before starting Julia you need to give it more threads like export JULIA_NUM_THREADS=4 on OSX/Linux.

--- Tanh ---
BenchmarkTools.TrialJudgement: 
  time:   -76.94% => improvement (5.00% tolerance)
  memory: +150.00% => regression (1.00% tolerance)
--- RELU ---
BenchmarkTools.TrialJudgement: 
  time:   -55.77% => improvement (5.00% tolerance)
  memory: +150.00% => regression (1.00% tolerance)

julia> t1
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     143.728 ms (0.00% GC)
  median time:      175.690 ms (0.00% GC)
  mean time:        163.612 ms (0.00% GC)
  maximum time:     178.066 ms (0.00% GC)
  --------------
  samples:          31
  evals/sample:     1

julia> t2
BenchmarkTools.Trial: 
  memory estimate:  80 bytes
  allocs estimate:  2
  --------------
  minimum time:     40.205 ms (0.00% GC)
  median time:      40.517 ms (0.00% GC)
  mean time:        40.570 ms (0.00% GC)
  maximum time:     43.109 ms (0.00% GC)
  --------------
  samples:          124
  evals/sample:     1

lruthotto commented 6 years ago

sorry, I only saw this now. It looks really promising, I agree!

Question: How stable are the threads? I remember this to being an experimental feature. As long as this works in a stable way also for more complicated functions, I'm fine using it.

klensink commented 6 years ago

I have the same question/concern as you. I've only used it before for simple loops, so we will need to do some testing to see how it handles more complicated loops.

It also seems to be a little finicky about your hardware. I've had no problems on Macs, but I had some problems yesterday on my Ubuntu machine

lruthotto commented 6 years ago

another question is how Threads performs when its being used at different layers. Say, threading over all the examples in the forward propagation and then also using threading in the activation might cause problems, right?

klensink commented 6 years ago

Yes you need to be careful how you layer them. If you end up using more than you have they block each other and the program grinds to a halt.

The same is also true for making sure that if you're making BLAS calls inside the loop, that the number of BLAS threads times the number of Julia threads doesn't exceed how many threads you have.

klensink commented 6 years ago

On a similar note, couldn't we also create in place apply functions for the normalization layers? I tried adding this and the derivative tests failed, so I assume it isn't always going to be the case. But in some places, like the example below, it seems like it shouldn't be an issue because the data is being overwritten anyways.

https://github.com/XtractOpen/Meganet.jl/blob/master/src/layers/singleLayer.jl#L37

eldadHaber commented 6 years ago

Maybe generate apply!

On Feb 8, 2018, at 5:44 PM, Keegan Lensink notifications@github.com wrote:

On a similar note, couldn't we also create in place apply functions for the normalization layers? I tried adding this and the derivative tests failed, so I assume it isn't always going to be the case. But in some places, like the example below, it seems like it shouldn't be an issue because the data is being overwritten anyways.

https://github.com/XtractOpen/Meganet.jl/blob/master/src/layers/singleLayer.jl#L37

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

lruthotto commented 6 years ago

I think an inplace apply function would be good to have for other Meganet elements as well. In particular, I'm talking about those that do not change the dimension of the input features.

Why don't we rename the current method to apply! with the understanding that it'll overwrite the input features. Then let's also add a smaller function apply that only copies the input features and then calls apply!

I'm not sure if this works, but what we could also be doing is that apply! accepts an input argument that gets over-written with the output features. this is similar to A_mul_B! in julia. This version would also allow pre-allocation for Meganet elements that change the dimension of the input features . Maybe this is a better option?

erantreister commented 6 years ago

Hi Guys,

I wasn't aware of the conversations here, but here is a suggestion for you to check:

function reluActivation!(A::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
A .= max.(A,zero(T));
if doDerivative
    dA .= sign.(A);
else
    dA = zeros(T,0)
end
return A,dA
end

The .= operator does stuff in place (no allocation), and internally it uses vectorization and some multi-threading. I think that this is the best way to go - I had a really good experience with it so far. I only found out about that recently.

If you wish to work with Threads, that's an option, but as far as my experience goes - they are far less efficient than openMP threads in C. I still think that Julia workers is the best way (maybe not use all possible workers and leave some for automatic internal multithreading in BLAS and stuff). I can explain this further on Skype if you wish.

XtractOpen / Meganet.jl

Preallocation in activations #25