benchmarks with other dynamic frameworks

denizyuret commented 7 years ago

Let's benchmark and improve any shortcomings we find. We can track progress using this issue. Some resources:

ilkerkesen commented 7 years ago

I forked the benchmark repo and then I tried to install DyNet on our test machine but failed to run examples provided by DyNet. I opened an issue about it. Right now, I am reading Dynet and TF Fold papers.

denizyuret commented 7 years ago

Any progress? On Sun, Feb 12, 2017 at 2:46 PM İlker Kesen notifications@github.com wrote:

I forked the benchmark repo and then I tried to install DyNet on our test machine but failed to run examples provided by DyNet. I opened an issue https://github.com/clab/dynet/issues/298 about it. Right now, I am reading Dynet and TF Fold papers.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-279213010, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpgqR_wxhugQMnfeZ0_DSjnu7ufc7ks5rbvESgaJpZM4L-dFh .

ilkerkesen commented 7 years ago

I installed Chainer on our test machine successfully, still working for DyNet. I implemented first model (RNN language model) and I am able to compare it with Chainer and Theano (TensorFlow needs to be updated also in our machine). Right now, Knet completes a validation period in ~16 seconds while Chainer completes in ~13 seconds.

denizyuret commented 7 years ago

Enis'i de bu projeye assign ediyorum. ML projesi olarak benchmark'lar uzerinde birlikte calisabilirsiniz. Enis kernel implementation, Ilker Julia implementation uzerine konsantre olabilir.

ilkerkesen commented 7 years ago

I've completed all the benchmark implementations. I will continue with the profiling. Finally, I was able to install DyNet on our test machine; but I couldn't have made the benchmark examples work and I opened another issue on their benchmark repository. No response yet. Though, the official examples run without any problem.

I forked their repository, created knet branch and push Knet implementations to that branch. You can see Knet implementations from here. I did run Chainer and Knet examples on GPUs. Here is the results:

Chainer: word_per_sec=16150.1527, Knet: word_per_sec=13131.2709 (RNN language model)
Chainer: word_per_sec=157.5202, Knet: word_per_sec=271.5683 (BLSTM NER tagger)
Chainer: word_per_sec=128.5317, Knet: word_per_sec=232.5641 (BLSTM tagger with chars for rare words)
Chainer: sent_per_sec=10.1543, Knet: sent_per_sec=16.8104 (Tree-structured LSTM for sentiment classification)

I started another repository for easy benchmark profiling, but I have some small issues with profiling tools. When I solve them, it will be easy to profile those benchmarks for anyone.

Another thing, they did not perform minibatching for the implementations (except RNN language model), because minibatching is a nontrivial process for dynamic-shaped architectures. For instance, I still do not know how to construct minibatches for a tree-structured data in Knet. However, I think an experienced Knet user knows what to do in most situations. Let's think about the third benchmark example, Bidirectional LSTM Tagger with character-embeddings for rare words (link). In that example, you need to embed all the words before operating the main bidirectional LSTM network. If the word is a common word, you just multiply that word's one-hot vector with an embedding matrix. However, the word is a rare word (there's an occurrence threshold), you need to multiply the characters of that word with character embedding matrix and operate those character embeddings on another bidirectional LSTM network for encoding that word. Common and rare words in a minibatch do not have to align. So, if the programmer identifies common and rare words and creates new sub-minibatches in the embedding section, the implementation will handle minibatching on that example efficiently.

ilkerkesen commented 7 years ago

By the way, TF Fold handles minibatching in a different way. They introduced dynamic batching algorithm. They define two different terms, ops and operations. In the above example, embedding a word is an ops, propagating word/char embedding on the main bidirectional LSTM network is an operation. Ops are done individually (let's say batchsize=10, 6 common words and 4 rare words, embed 6 common words all together, embed 4 rare words with BLSTM all together) and then gathered (combine and send them to the main BLSTM network).

ilkerkesen commented 7 years ago

I switched from one-hot vectors to indexes in RNN language model and BiLSTM tagger examples. Here is the new results,

RNN lang. model: word_per_sec=15988.6631
BiLSTM tagger: word_per_sec=585.7604

Right now, our RNN language model example's timing is comparable with the others (Chainer=14.5k words second, DyNet=18.5k words per second reported in the paper) and DyNet's BiLSTM tagger is ~2 times faster than ours (they reported 1250 words per second for Dynet, 147 words per second for Chainer).

I will switch to indexing in the two other remaining examples; but first I need to think about how to take advantage of this new feature in those examples efficiently. I will also try to handle softmax matrix calculations in just a single matrix calculation by concatenating hidden states.

denizyuret commented 7 years ago

I added multiple KnetArray concatenation (multi-arg hcat and vcat) in the latest master. In my s2s example, it slowed the forward calculation 20%, improved forw+back about 10%. Did not profile yet.

On Mon, Mar 13, 2017 at 9:49 PM İlker Kesen notifications@github.com wrote:

I switched from one-hot vectors to indexes in RNN language model and BiLSTM tagger examples. Here is the new results,

RNN lang. model: word_per_sec=15988.6631

BiLSTM tagger: word_per_sec=585.7604

Right now, our RNN language model example's timing is comparable with the others (Chainer=14.5k words second, DyNet=18.5k words per second reported in the paper) and DyNet's BiLSTM tagger is ~1 times faster than ours (they reported 1250 words per second for Dynet, 147 words per second for Chainer).

I will switch to indexing in the two other remaining examples; but first I need to think about how to take advantage of this new feature in those examples efficiently. I will also try to handle softmax matrix calculations in just a single matrix calculation by concatenating hidden states.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-286206068, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpqVtWOH_M4c4JAKuVItYzqo5ToG9ks5rlY-lgaJpZM4L-dFh .

denizyuret commented 7 years ago

Ilker: about your idea for saving time for input embeddings: Given the cost of sum_outgrads I think there might be some gain with: (1) keeping the embedding matrix outside the parameter list, so autograd does not automatically generate a giant derivative matrix. (2) extracting embeddings for all words for a sequence-minibatch at the beginning of the iteration like you mentioned, and feeding this as one of the parameters. (3) feeding slices from this, to the lstm. (4) having update change this reduced matrix. (5) copying back the changes to the big matrix.

On Tue, Mar 14, 2017 at 10:38 PM Deniz Yuret denizyuret@gmail.com wrote:

I added multiple KnetArray concatenation (multi-arg hcat and vcat) in the latest master. In my s2s example, it slowed the forward calculation 20%, improved forw+back about 10%. Did not profile yet.

On Mon, Mar 13, 2017 at 9:49 PM İlker Kesen notifications@github.com wrote:

I switched from one-hot vectors to indexes in RNN language model and BiLSTM tagger examples. Here is the new results,

RNN lang. model: word_per_sec=15988.6631

BiLSTM tagger: word_per_sec=585.7604

Right now, our RNN language model example's timing is comparable with the others (Chainer=14.5k words second, DyNet=18.5k words per second reported in the paper) and DyNet's BiLSTM tagger is ~1 times faster than ours (they reported 1250 words per second for Dynet, 147 words per second for Chainer).

I will switch to indexing in the two other remaining examples; but first I need to think about how to take advantage of this new feature in those examples efficiently. I will also try to handle softmax matrix calculations in just a single matrix calculation by concatenating hidden states.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-286206068, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpqVtWOH_M4c4JAKuVItYzqo5ToG9ks5rlY-lgaJpZM4L-dFh .

ilkerkesen commented 7 years ago

We have 30.2239 sentences per second in Tree-structured LSTM example with array indexing right now. Before indexing it was 16 sentences per second. In the paper reported results are 90 sentences per second for DyNet and 7 sentencess per second for Chainer. I did get 10 sentences per second on aitest machine for Chainer treenn example.

denizyuret commented 7 years ago

Great news! Let's keep profiling and improving...

On Sat, Mar 18, 2017 at 5:02 PM İlker Kesen notifications@github.com wrote:

We have 30.2239 words per second in Tree-structured LSTM example with array indexing right now. Before indexing it was 16 words per second. In the paper reported results are 90 words per second for DyNet and 7 words per second for Chainer. I did get 10 words per second on aitest machine for Chainer treenn example.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-287548242, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpo40jMUwDdyVLoBg6wBGwl2ey2ohks5rm-PhgaJpZM4L-dFh .

denizyuret commented 7 years ago

Are we able to do cpu comparisons with dynet? On Sat, Mar 18, 2017 at 5:28 PM Deniz Yuret denizyuret@gmail.com wrote:

Great news! Let's keep profiling and improving...

On Sat, Mar 18, 2017 at 5:02 PM İlker Kesen notifications@github.com wrote:

We have 30.2239 words per second in Tree-structured LSTM example with array indexing right now. Before indexing it was 16 words per second. In the paper reported results are 90 words per second for DyNet and 7 words per second for Chainer. I did get 10 words per second on aitest machine for Chainer treenn example.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-287548242, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpo40jMUwDdyVLoBg6wBGwl2ey2ohks5rm-PhgaJpZM4L-dFh .

ilkerkesen commented 7 years ago

Yes, we have DyNet installed on aitest. Let me do it in this night.

denizyuret commented 7 years ago

Did you try your idea of feeding a subset of the embedding matrix as a parameter each iteration?

On Sat, Mar 18, 2017 at 7:56 PM İlker Kesen notifications@github.com wrote:

Yes, we have DyNet installed on aitest. Let me do it in this night.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-287559101, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpvzuMBlkIxwVKl8HJ7wE_JZv3zTAks5rnAy0gaJpZM4L-dFh .

ilkerkesen commented 7 years ago

I think I need to think about it again. Apart from Adam optimization, in the simplest case (vanilla SGD) we need to get subembedding array like that,

subembed = embed[batch_inds,:]

Then, we need to update it after update! call,

embed[batch_inds,:] = subembed

I applied that idea to Adam optimization. In that case I perform the operations above not just for embedding matrix but also history matrices. It made worse (~5k words per second while we were already doing ~16k words per second with previous setting).

I also performed CPU benchmarks for DyNet, Chainer and Knet. DyNet is really good on CPU machines. DyNet is built on top of Eigen and this might be the reason (TensorFlow also uses that).

RNN language model

DyNet, word_per_sec=3147.6938 (dense updates), (I will also share here sparse update results)
Knet, word_per_sec=1116.9082
Chainer, word_per_sec=1224.6834

BiLSTM Tagger (without chars)

DyNet, word_per_sec=492.2673 (dense updates), word_per_sec=9509.4333 (sparse updates)
Knet, word_per_sec=328.1794
Chainer, word_per_sec=23.5120

BiLSTM Tagger (with chars)

DyNet, word_per_sec=469.7510 (dense updates), word_per_sec=7420.1764
Knet, word_per_sec=43.3591 (but work in progress for indexing)
Chainer, (do not have currently because there's an error, I will fix it)

Tree-structured LSTM

DyNet, sent_per_sec=121.1918 (dense updates), sent_per_sec=205.0839 (sparse updates)
Knet, sent_per_sec=1.2532
Chainer, sent_per_sec=5.6836

My implementations really fail on CPUs. I think I need to concentrate on CPU profiling (on bilstm-char and treelstm) and also I need to check whether DyNet benchmark examples take advantage of other fancy features like multi-core processing or not.

ilkerkesen commented 7 years ago

In RNN language model I've also tried my other idea (concatenate all timesteps' indices, get embeddings from those indices and then take slices like embed[(t-1)*batchsize+1:t*batchsize-1,:] per timestep) it didn't give me any improvement (~5.5k words per second), too.

I performed a basic benchmarking on CPU array slicing by taking slices from rows and columns with several scenarios. Results are here. Obtaining slices from columns is much more efficient (but we already use it in LSTM). So, I tried also column slicing+transpose, and it is more efficient than taking slices from rows if we want to obtain many vectors. If we take just a row vector or just a column vector, it does not matter that much but obtaining column vector is a little bit faster. Therefore, I tried to take its transpose and reshape it, but those ideas made it worse.

ilkerkesen commented 7 years ago

Right now all benchmark examples use indexing and here is the result for remaining example,

BiLSTM tagger with chars, word_per_sec=456.1816 (it was word_per_sec=232.5641 previously)

I've tried several things on that example but failed. Now I will focus on profiling.

ilkerkesen commented 7 years ago

Same indexing procedure did not give good results for CPU on BiLSTM tagger with chars, now I get word_per_sec=5.3454. However, in my previous setting I was getting that cudaMemcpy 77 error again and again and I was not able to replicate that error in a small example. I will try to find a good indexing procedure which works good for both GPU and CPU and does not give cudaMemcpy error.

denizyuret commented 7 years ago

Let me know if you find a way to replicate 77. Typically once you get it you need to restart Julia or you will keep getting it. On Mon, Mar 20, 2017 at 10:55 PM İlker Kesen notifications@github.com wrote:

Same indexing procedure did not give good results for CPU on BiLSTM tagger with chars, now I get word_per_sec=5.3454. However, in my previous setting I was getting that cudaMemcpy 77 error again and again and I was not able to replicate that error in a small example. I will try to find a good indexing procedure which works good for both GPU and CPU and does not give cudaMemcpy error.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-287878569, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpum0UNqtTu6UEWODn-I_wNxfHVx3ks5rntmugaJpZM4L-dFh .

ilkerkesen commented 7 years ago

Finally I am able to replicate 77 error.

using Knet
karr = KnetArray(randn(Float32, 4,5))
ksub = karr[Int32[],:]
karr[Int32[2,3],:]

denizyuret commented 7 years ago

Eline saglik. Did you figure out why we got a 77 error instead of some divide by 0 error, or why we only sometimes got the error? Joys of cuda programming... Hopefully we won't have this problem again.

On Wed, Mar 22, 2017 at 12:17 AM İlker Kesen notifications@github.com wrote:

Finally I am able to replicate 77 error.

using Knet karr = KnetArray(randn(Float32, 4,5)) ksub = karr[Int32[],:] karr[Int32[2,3],:]

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-288221496, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpvzUs_eL6STcEocqV5H54nl0QdOCks5roD5LgaJpZM4L-dFh .

ilkerkesen commented 7 years ago

I think the behavior is not identical with C, so we don't see any divide by zero error. However, we get a nonsense value as a result. Then, we perform an illegal memory address. CUDART does not give us some error, because this happens in a kernel already called. Then, we try to do something via CUDART and we see the error eventually (my assumptions).

I will share the current results for BiLSTM tagger with chars example today as well as with profiling observations.

ilkerkesen commented 7 years ago

New results for BiLSTM Tagger with chars example,

GPU: word_per_sec=504.6929 (previously word_per_sec=456.1816)
CPU: word_per_sec=47.0973 (previously word_per_sec=43.3591)

Finally I share my profiling observations. I did implement tree-structured network in a recursive manner, so it was not easy to profile that example and I am not so confident about profiling of that. We discussed about bottlenecks in seq2seq model and yes, different architectures comes with different bottlenecks.

RNN Language Model (forward)

Only forward on GPU, 56% logp, 44% operations happen in LSTM + gemm before logp
Only forward on CPU, 75% logp
Forw+back+update on GPU, 31% forw (40% logp, %60 LSTM ops), 65% back (43% sum_outgrads, 10% logp, 35% gemm, 12% others), 4% update
Forw+back+update on CPU, 25% forw (75% logp, 25% others), 71% back (23% sum_outgrads, 26% logp, 51% other parts), 4% update

BiLSTM Tagger

Only forward on GPU, 75% gemm+eltwise+bcast+unary, 25% logp
Only forward on CPU, 10% logp, %90 gemm+eltwise+bcast+unary
Forw+back+update on GPU, 43% forw (24% logp, %76 gemm+eltwise+bcast+unary), 57% back (38% sum_outgrads, 8% logp, 18% gemm, %36 mostly broadcasting), not much update
Forw+back+update on CPU, 39% update, 20% forw, 40% back (%57 sum_outgrads)

BiLSTM Tagger with chars

Only forward on GPU, 33% logp, 67% gemm+eltwise+bcast+unary but mostly gemm
Only forward on CPU, 5% logp, 95% gemm+eltwise+bcast+unary
Forw+back+update on GPU, 51% forw (21% logp, %79 encoder), 49% back (44% sum_outgrads, 6% logp, 50% other)
Forw+back+update on CPU, 5% forw, 64% back (91% sum_outgrads), 31% update

Tree-structured LSTM

Only forward on GPU, 40% logp, 16% leaf lstm, 35% parent lstm, 9% gemm before softmax
Only forward on CPU, 57% parent lstm, 29% leaf lstm, 5% logp, 9% gemm before softmax
Forw+back+update on GPU, 42% forw (38% logp), 55% (38% sum_outgrads)
Forw+back+update on CPU, 14% update, 1% forw, 85% back (97% sum_outgrads)

denizyuret commented 7 years ago

I tested overwriting sum_outgrads with s2s but did not see a significant difference. Can you try it on your examples? You need to use the "dev" branch for AutoGrad and Knet: Pkg.checkout("AutoGrad","dev"); Pkg.checkout("Knet","dev"); Pkg.build("Knet").

On Wed, Mar 22, 2017 at 11:07 PM İlker Kesen notifications@github.com wrote:

New results for BiLSTM Tagger with chars example,

GPU: word_per_sec=504.6929 (previously word_per_sec=456.1816)

CPU: word_per_sec=47.0973 (previously word_per_sec=43.3591)

Finally I share my profiling observations. I did implement tree-structured network in a recursive manner, so it was not easy to profile that example and I am not so confident about profiling of that. We discussed about bottlenecks in seq2seq model and yes, different architectures comes with different bottlenecks. RNN Language Model (forward)

Only forward on GPU, 56% logp, 44% operations happen in LSTM + gemm before logp

Only forward on CPU, 75% logp

Forw+back+update on GPU, 31% forw (40% logp, %60 LSTM ops), 65% back (43% sum_outgrads, 10% logp, 35% gemm, 12% others), 4% update

Forw+back+update on CPU, 25% forw (75% logp, 25% others), 71% back (23% sum_outgrads, 26% logp, 51% other parts), 4% update

BiLSTM Tagger

Only forward on GPU, 75% gemm+eltwise+bcast+unary, 25% logp

Only forward on CPU, 10% logp, %90 gemm+eltwise+bcast+unary

Forw+back+update on GPU, 43% forw (24% logp, %76 gemm+eltwise+bcast+unary), 57% back (38% sum_outgrads, 8% logp, 18% gemm, %36 mostly broadcasting), not much update

Forw+back+update on CPU, 39% update, 20% forw, 40% back (%57 sum_outgrads)

BiLSTM Tagger with chars

Only forward on GPU, 33% logp, 67% gemm+eltwise+bcast+unary but mostly gemm

Only forward on CPU, 5% logp, 95% gemm+eltwise+bcast+unary

Forw+back+update on GPU, 51% forw (21% logp, %79 encoder), 49% back (44% sum_outgrads, 6% logp, 50% other)

Forw+back+update on CPU, 5% forw, 64% back (91% sum_outgrads), 31% update

Tree-structured LSTM

Only forward on GPU, 40% logp, 16% leaf lstm, 35% parent lstm, 9% gemm before softmax

Only forward on CPU, 57% parent lstm, 29% leaf lstm, 5% logp, 9% gemm before softmax

Forw+back+update on GPU, 42% forw (38% logp), 55% (38% sum_outgrads)

Forw+back+update on CPU, 14% update, 1% forw, 85% back (97% sum_outgrads)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-288522773, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpnz8aVkbr1NlBSsTGSI6TI0cpHAsks5roX91gaJpZM4L-dFh .

denizyuret commented 7 years ago

Also, if you can figure out which methods of sumoutgrads (profile should tell you this) and which calls (can probably get that by debug printing out the sizes of inputs) that would help. On Thu, Mar 23, 2017 at 8:16 PM Deniz Yuret denizyuret@gmail.com wrote:

I tested overwriting sum_outgrads with s2s but did not see a significant difference. Can you try it on your examples? You need to use the "dev" branch for AutoGrad and Knet: Pkg.checkout("AutoGrad","dev"); Pkg.checkout("Knet","dev"); Pkg.build("Knet").

On Wed, Mar 22, 2017 at 11:07 PM İlker Kesen notifications@github.com wrote:

New results for BiLSTM Tagger with chars example,

GPU: word_per_sec=504.6929 (previously word_per_sec=456.1816)

CPU: word_per_sec=47.0973 (previously word_per_sec=43.3591)

Finally I share my profiling observations. I did implement tree-structured network in a recursive manner, so it was not easy to profile that example and I am not so confident about profiling of that. We discussed about bottlenecks in seq2seq model and yes, different architectures comes with different bottlenecks. RNN Language Model (forward)

Only forward on GPU, 56% logp, 44% operations happen in LSTM + gemm before logp

Only forward on CPU, 75% logp

Forw+back+update on GPU, 31% forw (40% logp, %60 LSTM ops), 65% back (43% sum_outgrads, 10% logp, 35% gemm, 12% others), 4% update

Forw+back+update on CPU, 25% forw (75% logp, 25% others), 71% back (23% sum_outgrads, 26% logp, 51% other parts), 4% update

BiLSTM Tagger

Only forward on GPU, 75% gemm+eltwise+bcast+unary, 25% logp

Only forward on CPU, 10% logp, %90 gemm+eltwise+bcast+unary

Forw+back+update on GPU, 43% forw (24% logp, %76 gemm+eltwise+bcast+unary), 57% back (38% sum_outgrads, 8% logp, 18% gemm, %36 mostly broadcasting), not much update

Forw+back+update on CPU, 39% update, 20% forw, 40% back (%57 sum_outgrads)

BiLSTM Tagger with chars

Only forward on GPU, 33% logp, 67% gemm+eltwise+bcast+unary but mostly gemm

Only forward on CPU, 5% logp, 95% gemm+eltwise+bcast+unary

Forw+back+update on GPU, 51% forw (21% logp, %79 encoder), 49% back (44% sum_outgrads, 6% logp, 50% other)

Forw+back+update on CPU, 5% forw, 64% back (91% sum_outgrads), 31% update

Tree-structured LSTM

Only forward on GPU, 40% logp, 16% leaf lstm, 35% parent lstm, 9% gemm before softmax

Only forward on CPU, 57% parent lstm, 29% leaf lstm, 5% logp, 9% gemm before softmax

Forw+back+update on GPU, 42% forw (38% logp), 55% (38% sum_outgrads)

Forw+back+update on CPU, 14% update, 1% forw, 85% back (97% sum_outgrads)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-288522773, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpnz8aVkbr1NlBSsTGSI6TI0cpHAsks5roX91gaJpZM4L-dFh .

ilkerkesen commented 7 years ago

OK, I will perform a more detailed profiling analysis but the last changes did not give me any improvement on BiLSTM tagger with chars example (on CPU). I will try this on Tree model but we need to wait because one epoch training it takes ~2 hours on CPU. I also tried it on RNN language model example (GPU), I couldn't obtain any improvement on that example, too.

denizyuret commented 7 years ago

So either my code isn't working, or extra allocations do not cost us anything.... On Thu, Mar 23, 2017 at 10:07 PM İlker Kesen notifications@github.com wrote:

OK, I will perform a more detailed profiling analysis but the last changes did not give me any improvement on BiLSTM tagger with chars example (on CPU). I will try this on Tree model but we need to wait because one epoch training it takes ~2 hours on CPU. I also tried it on RNN language model example (GPU), I couldn't obtain any improvement on that example, too.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-288828809, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpi9N5oz2WaP09xR6dUT5ILWlE2OQks5rosLtgaJpZM4L-dFh .

ilkerkesen commented 7 years ago

New result for Tree-LSTM is sent_per_sec=1.9579 (previously 1.2532). But I think I need to run those examples more than once.

denizyuret commented 7 years ago

I tried column major instances and merging of all inputs and outputs and got dramatic speedup in rnnlm. I can see 40K wps on rnnlm when single iter benchmarking and 35K wps when processing 1M tokens. The biggest factor is reductions like logp are apparently faster when vertical. This more than offsets the inefficiency of vertical slicing in LSTM. I am not sure about the net effect of split/merge of all inputs and outputs, need further testing. We also need to re-profile the backward pass...

On Fri, Mar 24, 2017 at 11:20 AM İlker Kesen notifications@github.com wrote:

New result for Tree-LSTM is sent_per_sec=1.9579 (previously 1.2532). But I think I need to run those examples more than once.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-288960430, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNplKwczMJpBwfV0BvtCLacPvrpDXhks5ro3zigaJpZM4L-dFh .

denizyuret commented 7 years ago

The split/merge trick may not give a speed up when sequence length keeps changing. We keep requesting large matrices of different sizes which is not efficient. Will test to make sure.

On Sun, Mar 26, 2017, 01:59 Deniz Yuret denizyuret@gmail.com wrote:

I tried column major instances and merging of all inputs and outputs and got dramatic speedup in rnnlm. I can see 40K wps on rnnlm when single iter benchmarking and 35K wps when processing 1M tokens. The biggest factor is reductions like logp are apparently faster when vertical. This more than offsets the inefficiency of vertical slicing in LSTM. I am not sure about the net effect of split/merge of all inputs and outputs, need further testing. We also need to re-profile the backward pass...

On Fri, Mar 24, 2017 at 11:20 AM İlker Kesen notifications@github.com wrote:

New result for Tree-LSTM is sent_per_sec=1.9579 (previously 1.2532). But I think I need to run those examples more than once.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-288960430, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNplKwczMJpBwfV0BvtCLacPvrpDXhks5ro3zigaJpZM4L-dFh .

ilkerkesen commented 7 years ago

Column-major RNN LM does not show me any improvement. I still get ~15k words per second. I will try those tricks though.

ilkerkesen commented 7 years ago

I've just made my BiLSTM example column-majored and now I see 635 words per second. It was 586 words per second previously. I'll make other two remaining examples column-majored and see what happens.

ilkerkesen commented 7 years ago

Making Tree-structured LSTM example column-majored didn't give me any improvement. I applied a trick to BiLSTM tagger and now our performance is 654 words per second on that example. I haven't made BiLSTM tagger with chars example column-majored yet. I am also trying Enis's broadcast kernels. Results with recently introduced bcast kernels as follows,

RNN language model (row-major): word_per_sec=16571.9309 (previously word_per_sec=15988.6631)
BiLSTM tagger (column-major): word_per_sec=675.7658 (previously word_per_sec=~654)
Tree-structured LSTM (column-major, it does not differ much though): sent_per_sec=32.1688 (previously sent_per_sec=30.2239)

denizyuret commented 7 years ago

Did you guys go over Enis's kernels and find the bug with the for loop one? Also let's convert the new kernel calling C functions to take block/thread as args so we can optimize. On Mon, Mar 27, 2017 at 8:53 PM İlker Kesen notifications@github.com wrote:

Column-major RNN LM does not show me any improvement. I still get ~15k words per second. I will try those tricks though.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-289422364, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpmTYzhu_QYY5dmSknGkEYcUFXSUKks5rp5hBgaJpZM4L-dFh .

ilkerkesen commented 7 years ago

We detected the problematic part but couldn't have solve it, Enis will handle that. Getting 77 error. His unrolled kernel seemed OK but the other one is problematic. Maybe we can generate many unrolled kernels for N-D case up to some limit?

denizyuret commented 7 years ago

I think we should unroll up to length 5, then use the generic kernel for higher dimensions if we can get it to work.

On Mon, Mar 27, 2017 at 9:09 PM İlker Kesen notifications@github.com wrote:

We detected the problematic part but couldn't have solve it, Enis will handle that. Getting 77 error. His unrolled kernel seemed OK but the other one is problematic. Maybe we can generate many unrolled kernels for N-D case up to some limit?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-289536399, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpkRGsBfSYshpornuNZPGzDybuuwWks5rp_s_gaJpZM4L-dFh .

ilkerkesen commented 7 years ago

By the way, I've tried different blocksize/threadperblock settings (in just indexing) but couldn't achieve significant performance improvement in RNNLM. I think if array is big enough, then it does not matter that much. However, we may waste some threads in certain cases. For instance, I guess my embedding to subembedding thing didn't worked because of that (result arrays might be bigger in that case).

denizyuret commented 7 years ago

What perplexity are we supposed to get with rnnlm on 10k vocab ptb? On Mon, Mar 27, 2017 at 11:32 PM İlker Kesen notifications@github.com wrote:

By the way, I've tried different blocksize/threadperblock settings (in just indexing) but couldn't achieve significant performance improvement in RNNLM. I think if array is big enough, then it does not matter that much. However, we may waste some threads in certain cases. For instance, I guess my embedding to subembedding thing didn't worked because of that (result arrays might be bigger in that case).

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-289577046, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpsXjtCH5AtAB5Rg90_Ye3k6Yxte1ks5rqBz3gaJpZM4L-dFh .

ilkerkesen commented 7 years ago

I've just seen this project,

http://parallelacceleratorjl.readthedocs.io/en/latest/

ilkerkesen commented 7 years ago

I've looked at sum_outgrads more detailed and here's my observations,

In CPU, mostly memory allocation (zeroslike, 95%)
In GPU, 46% add_cols kernel, 13% memory allocation, 24% axpy!, 8% add_rows, remaining is not important

These observations came from column-majored implementation. Let me look at row-majored implementation also because of that difference between add_cols and add_rows kernels.

ilkerkesen commented 7 years ago

I did run PyTorch examples. Here is the results,

RNNLM: word_per_sec=57596.8718 (Knet=~16k)
BiLSTM tagger withchar: word_per_sec=1641.4528 (Knet=~655)
BiLSTM tagger: word_per_sec=1641.0227 (Knet=~570)

I think BiLSTM results are very fast but reliable. However, RNNLM implementation is super fast and I don't know why. I checked if the examples take advantage of half precision float numbers, but they do not currently.

Anyway, I checked their project page and the authors indicated that they're using cudnn and other torch libraries (thc, thcunn for CUDA). In RNNLM, there are three different main parts: input lookup table), LSTM module and output (softmax). I am not sure which part uses which library, but I did give a look to THCUNN's softmax kernel and it's a long fused version of our kernels (bcast, reduce, unary ops).

We currently don't have any issue for softmax kernel discussion. I can try two different things,

Integrate THCUNN's softmax kernel
Try to take advantage of cuDNN's softmax interface (but I am not sure whether that row/column major thing will be a problem or not)

ilkerkesen commented 7 years ago

I figured out, that PyTorch RNNLM example was cheating a little bit. Its recurrent module was Vanilla RNN instead of LSTM. I made it LSTM, but it is still very very fast (51k words per second). PyTorch takes advantage of cuDNN's RNN/LSTM kernels. However, there's no clue for usage of cuDNN softmax kernel.

jekbradbury commented 7 years ago

Hey, PyTorch and Chainer contributor here. I'm really excited about Knet and Julia deep learning in general, especially because Julia can solve many of the problems we keep running into in Python around speed and the need for C extensions. The kinds of benchmarks you're doing are useful, but I think you should also compare with PyTorch with torch.backends.cudnn.enabled = False so you aren't trying to beat cuDNN. Another thing that would be great to see is benchmarks with very low-compute networks (e.g. 100 sigmoids in a row on a tensor of size 5, then backprop) in order to isolate autograd overhead (which is ~200μs per call for Chainer because it's pure Python and ~20μs for PyTorch and DyNet because they use C++ extensions).

denizyuret commented 7 years ago

Thanks for the feedback!

We use cudnn for convolution as well but not for much else right now.

Does cudnn=false use other kernels in pytorch or no gpu? AutoGrad overhead with a shallow mlp was around 10% so we did not prioritize it but it is a good idea to isolate and compare with alternatives. Also in the list is to try other gradient packages in Julia (e.g. ReverseDiff) which may be more efficient.

Finally we still cannot replicate the Dynet benchmarks and got no response to our issue requests. Did you have any luck with it? On Fri, Apr 28, 2017 at 3:25 AM jekbradbury notifications@github.com wrote:

Hey, PyTorch and Chainer contributor here. I'm really excited about Knet and Julia deep learning in general, especially because Julia can solve many of the problems we keep running into in Python around speed and the need for C extensions. The kinds of benchmarks you're doing are useful, but I think you should also compare with PyTorch with torch.backends.cudnn.enabled = False so you aren't trying to beat cuDNN. Another thing that would be great to see is benchmarks with very low-compute networks (e.g. 100 sigmoids in a row on a tensor of size 5, then backprop) in order to isolate autograd overhead (which is ~200μs per call for Chainer because it's pure Python and ~20μs for PyTorch and DyNet because they use C++ extensions).

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-297874702, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpgsRt2wEXyeOgmzKfmcWZL1IjJPiks5r0THjgaJpZM4L-dFh .

jekbradbury commented 7 years ago

Setting cudnn.enabled to False will use other GPU kernels; you have to remove the .cuda() calls if you want to use CPU. I think your best bet for the DyNet benchmarks is to contact Graham directly (e.g. on Twitter twitter.com/gneubig).

denizyuret commented 7 years ago

TODO:

https://github.com/JuliaML/Roadmap.jl/issues : other packages, future organization...
https://github.com/JuliaGPU/GPUArrays.jl : possible alternative for KnetArray, better kernels...
https://github.com/JuliaDiff/ReverseDiff.jl : possible alternative for AutoGrad...
CUDNN v0.6 : use for more than convolution: batchnorm, lstm, etc.

denizyuret commented 6 years ago

New benchmark: https://www.reddit.com/r/MachineLearning/comments/776inl/p_rnn_in_5_different_frameworks/

ilkerkesen commented 6 years ago

I'm done with implementations. I will run them on Julia v0.6 and share the results.

ilkerkesen commented 6 years ago

Current results on cn2 (with cudnn/without cudnn),

rnnlm: 22.5k/14.5k (words per second)
bilstm tagger: 5.7k/0.6k (words per second)
bilstm tagger withchars: 1.3k/0.5k (words per second)

denizyuret commented 6 years ago

How does this compare to others? Our old results? Anything I can help improve? Is there a PR for DynetBenchmarks or Knet examples?

On Sun, Dec 3, 2017, 00:36 İlker Kesen notifications@github.com wrote:

Current results on cn2 (with cudnn/without cudnn),

rnnlm: 22.5k/14.5k (words per second)

bilstm tagger: 5.7k/0.6k (words per second)

bilstm tagger withchars: 1.3k/0.5k (words per second)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/78#issuecomment-348721957, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpsKnTGrTMlWQE78MnwAKoEff7bX0ks5s8cL3gaJpZM4L-dFh .

denizyuret / Knet.jl