Open lruthotto opened 6 years ago
I really like this, there are lots of places we can be doing things in place and save some GC time.
This, and the sparse kernels, are also a really good place to play around with mulithreading the for loops. I'll try to whip up an example before the call today, but I tested it quickly yesterday and got great results
Multithreading example:
using BenchmarkTools
Y = randn(10*768,2*512);
dY = zeros(size(Y));
function myTanhActivation_mt!{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
Threads.@threads for k=1:length(A)
@inbounds A[k] = tanh(A[k])
@inbounds dA[k] = 1-A[k]^2
end
return A,dA
end
function myTanhActivation!{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
for k=1:length(A)
@inbounds A[k] = tanh(A[k])
@inbounds dA[k] = 1-A[k]^2
end
return A,dA
end
function myTanhActivation{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
return myTanhActivation!(copy(A),copy(dA),doDerivative)
end
function myReluActivation!(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
for k=1:length(Y)
@inbounds Y[k] = max(Y[k],0.0);
@inbounds dA[k] = sign(Y[k]);
end
return Y,dA
end
function myReluActivation_mt!(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
Threads.@threads for k=1:length(Y)
@inbounds Y[k] = max(Y[k],0.0);
@inbounds dA[k] = sign(Y[k]);
end
return Y,dA
end
function myReluActivation(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
return myReluActivation!(copy(Y),copy(Y),doDerivative)
end
t1 = @benchmark myTanhActivation!($Y, $dY, true)
t2 = @benchmark myTanhActivation_mt!($Y, $dY, true)
r1 = @benchmark myReluActivation!($Y, $dY, true)
r2 = @benchmark myReluActivation_mt!($Y, $dY, true)
println("--- Tanh ---")
display(judge(median(t2), median(t1)))
println("--- RELU ---")
display(judge(median(r2), median(r1)))
Before starting Julia you need to give it more threads like export JULIA_NUM_THREADS=4
on OSX/Linux.
--- Tanh ---
BenchmarkTools.TrialJudgement:
time: -76.94% => improvement (5.00% tolerance)
memory: +150.00% => regression (1.00% tolerance)
--- RELU ---
BenchmarkTools.TrialJudgement:
time: -55.77% => improvement (5.00% tolerance)
memory: +150.00% => regression (1.00% tolerance)
julia> t1
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 1
--------------
minimum time: 143.728 ms (0.00% GC)
median time: 175.690 ms (0.00% GC)
mean time: 163.612 ms (0.00% GC)
maximum time: 178.066 ms (0.00% GC)
--------------
samples: 31
evals/sample: 1
julia> t2
BenchmarkTools.Trial:
memory estimate: 80 bytes
allocs estimate: 2
--------------
minimum time: 40.205 ms (0.00% GC)
median time: 40.517 ms (0.00% GC)
mean time: 40.570 ms (0.00% GC)
maximum time: 43.109 ms (0.00% GC)
--------------
samples: 124
evals/sample: 1
sorry, I only saw this now. It looks really promising, I agree!
Question: How stable are the threads? I remember this to being an experimental feature. As long as this works in a stable way also for more complicated functions, I'm fine using it.
I have the same question/concern as you. I've only used it before for simple loops, so we will need to do some testing to see how it handles more complicated loops.
It also seems to be a little finicky about your hardware. I've had no problems on Macs, but I had some problems yesterday on my Ubuntu machine
another question is how Threads performs when its being used at different layers. Say, threading over all the examples in the forward propagation and then also using threading in the activation might cause problems, right?
Yes you need to be careful how you layer them. If you end up using more than you have they block each other and the program grinds to a halt.
The same is also true for making sure that if you're making BLAS calls inside the loop, that the number of BLAS threads times the number of Julia threads doesn't exceed how many threads you have.
On a similar note, couldn't we also create in place apply
functions for the normalization layers? I tried adding this and the derivative tests failed, so I assume it isn't always going to be the case. But in some places, like the example below, it seems like it shouldn't be an issue because the data is being overwritten anyways.
https://github.com/XtractOpen/Meganet.jl/blob/master/src/layers/singleLayer.jl#L37
Maybe generate apply!
On Feb 8, 2018, at 5:44 PM, Keegan Lensink notifications@github.com wrote:
On a similar note, couldn't we also create in place apply functions for the normalization layers? I tried adding this and the derivative tests failed, so I assume it isn't always going to be the case. But in some places, like the example below, it seems like it shouldn't be an issue because the data is being overwritten anyways.
https://github.com/XtractOpen/Meganet.jl/blob/master/src/layers/singleLayer.jl#L37
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
I think an inplace apply function would be good to have for other Meganet elements as well. In particular, I'm talking about those that do not change the dimension of the input features.
Why don't we rename the current method to apply!
with the understanding that it'll overwrite the input features. Then let's also add a smaller function apply
that only copies the input features and then calls apply!
I'm not sure if this works, but what we could also be doing is that apply!
accepts an input argument that gets over-written with the output features. this is similar to A_mul_B!
in julia. This version would also allow pre-allocation for Meganet elements that change the dimension of the input features . Maybe this is a better option?
Hi Guys,
I wasn't aware of the conversations here, but here is a suggestion for you to check:
function reluActivation!(A::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
A .= max.(A,zero(T));
if doDerivative
dA .= sign.(A);
else
dA = zeros(T,0)
end
return A,dA
end
The .= operator does stuff in place (no allocation), and internally it uses vectorization and some multi-threading. I think that this is the best way to go - I had a really good experience with it so far. I only found out about that recently.
If you wish to work with Threads, that's an option, but as far as my experience goes - they are far less efficient than openMP threads in C. I still think that Julia workers is the best way (maybe not use all possible workers and leave some for automatic internal multithreading in BLAS and stuff). I can explain this further on Skype if you wish.
Maybe we should write two versions of activation functions (and other ones). One that does allocation and one that operates in place. See the following example that Eran and I have put together.