Test more cudnn functions (batchnorm, lstm, etc.) and use if faster. - Githubissues

denizyuret / Knet.jl

Koç University deep learning framework.

https://denizyuret.github.io/Knet.jl/latest

Other

1.43k stars 230 forks source link

Test more cudnn functions (batchnorm, lstm, etc.) and use if faster. #177

Open denizyuret opened 7 years ago

denizyuret commented 7 years ago

On a high level this is an interface that would be nice to have for rnns:

(weights, state) = initrnn(input)
(output,state) = rnn(weights,input,state)

state can be used to encapsulate various things, such as:

hidden and cell states.
type and dimensionality information for the RNN.
reservedSpace for the RNN.

input dimensionality can be:

1-D for single token. ndims=(X,)
2-D for a minibatch of tokens. ndims=(X,B)
3-D for a sequence of minibatches of tokens. ndims=(X,B,T)

Question:

How do we separate train vs test mode in cudnn? Test mode does not use reserveSpace and does not call dropout.

denizyuret commented 6 years ago

To implement RNNs from CUDNN we can follow the pattern in conv.jl. (see rnn.jl for evolving code and RNN_example.cu and http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide from NVIDIA as references).This means we don't keep around the cudnn descriptors, just create them on the fly whenever they are needed. Thus it is important to keep them lightweight. (Although we can keep them around in the state variable). Questions:

RNNDescriptor needs a valid DropoutDescriptor (passing NULL doesn't work, even when we don't want dropout!). Can multiple RNNs share the same DropoutDescriptor (so we can create and use a global one)?
DropoutDescriptors have a "states" array allocated, 600K on my machine. Can the "states" arrays be shared between multiple DropoutDescriptors?
At the very least we can have a zero dropout descriptor that is global and shared among rnns that do not use dropout and rnns that have a single layer.

denizyuret commented 6 years ago

Current design for the primitive rnn operation:

rnn(w,x,hx,cx,s; training=false) => (y,hy,cy,rs)
# hx,cx can be nothing
# s keeps mostly read-only info like numLayers
# rs is reserveSpace for the back functions, nothing for inference
# training determines whether inference or training is called

rnn_r=recorder(rnn)
rnn(w::Rec, x, hx, cx, s)=rnn_r(w,x,hx,cx,s; training=true)
# we assume w::Rec means we are training and call the recorder version

rnn(::Type{Grad{1}},dr,r,w,x,hx,cx,s)=((y,hy,cy,rs)=r; (dy,dhy,dcy,drs)=dr; backData(); backWeights(); set s.dx, s.dhx, s.dcx; return dw)
rnn(::Type{Grad{2}},dr,r,w,x,hx,cx,s)=s.dx
rnn(::Type{Grad{3}},dr,r,w,x,hx,cx,s)=s.dhx
rnn(::Type{Grad{4}},dr,r,w,x,hx,cx,s)=s.dcx
# we always need to call backData before backWeights. Here we do both in Grad{1} and record the results in s to be later retrieved by other Grad calls.

TODO:

Do something about rnninit.
Check this implementation against Julia using unit testing.
Figure out whether the bidirectional rnn works as we expect (has separate forw back weights etc).
Do we need to do anything for masking?
Figure out a nicer high level interface that hides hx, cx, rs etc.
Use the same trick to automate train/test for dropout. Use cudnn dropout if faster.
Implement cudnn batchnorm, discuss interface.
Take a look at bias, relu, logp etc comparing speed, implement if faster.

denizyuret commented 6 years ago

The return of hy, cy (and in our case sometimes y) is optional. By not returning these we can save some time and memory. I propose defining versions of rnn which do not return these (and send C_NULL to the cuda calls), how about:

rnn3(...) => (y,hy,rs)
rnn2(...) => (y,rs)

cangumeli commented 6 years ago

There are some cudnn datatypes (Filter and Tensor Descriptors) used in both rnn and cnn implementations. Should we refactor them into a new cudnn.jl file?

denizyuret commented 6 years ago

Sure. Sounds good.

On Mon, Nov 6, 2017, 01:53 cangumeli notifications@github.com wrote:

There are some cudnn datatypes (Filter and Tensor Descriptors) used in both rnn and cnn implementations. Should we refactor them into a new cudnn.jl file?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/177#issuecomment-342012790, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpiHJV4TGO_hZIdfTcXJlGS6mS9T5ks5szjxygaJpZM4QAafE .

denizyuret commented 6 years ago

Su an calisan bir implementation oldu src/rnn.jl altinda. Test script ile yaptigi isi Julia'da replicate etmeyi basardim test/rnn.jl altinda. Kalan isler:

testing: Bidirectional test edilmedi henuz, test/rnn.jl'e onu da eklemek gerek.
cpu-implementation: test/rnn.jl'deki rnntest tamamlandiktan sonra cpu implementation olarak src/rnn.jl'e tasinabilir. bu arada cpu, gpu, ve autograd icin call signature'larin iyi ayrilmasina dikkat etmek gerek.
documentation and examples: Hem public fonksiyon dokumantasyonu yazilmasi hem de olan doc ve example'lara entegrasyon.
benchmarking: prof/rnnlm.jl yeni rnn'lerin hizini test edecek sekilde genisletilebilir.
new mask mechanism: azalan batchsize ile masking yapma mekanizmasinin user interface'ini dusunmeli.
better interface: su an rnn ve rnninit fonksiyonlari var, bunlar yeterli mi, yoksa lstm, gru vs icin daha pratik interface mi lazim bakalim.

denizyuret commented 6 years ago

OK, rnn and batchnorm are done. Softmax and dropout are next to benchmark and integrate if they are worth it.

denizyuret commented 6 years ago

I integrated softmax using it for logp. prof/softmax.jl shows about double the speed.

denizyuret commented 6 years ago

I think dropout and bias-add are the next likely candidates to improve speed. Could also test activation functions.