ducha-aiki / caffenet-benchmark

Evaluation of the CNN design choices performance on ImageNet-2012.
742 stars 155 forks source link

What I will test next #1

Open ducha-aiki opened 8 years ago

ducha-aiki commented 8 years ago

Suggestions and training logs from community are welcomed.

ghost commented 8 years ago

Is this worth testing?

"Deep Learning with S-shaped Rectified Linear Activation Units"

http://arxiv.org/abs/1512.07030

ducha-aiki commented 8 years ago

Well, I don`t like activations which could not be done "in-place", but if SReLU will be implemented in caffe, I test it.

Darwin2011 commented 8 years ago

@ducha-aiki

I learn a lot from your this caffenet-benchmark, is there anything I can try to speedup the benchmarks? Thanks

ducha-aiki commented 8 years ago

@Darwin2011 Thank you! I can see two ways you can help: 1)Do your own benchmark and make PR like @dereyly done. 2)In caffenet-style networks, the real bottleneck, as I recently found is not GPU, but input data layer https://github.com/BVLC/caffe/pull/2252, which does jpeg decompression AND real-time rescaling. Its multicore implementation would help much.

bhack commented 8 years ago
ducha-aiki commented 8 years ago

@bhack thanks, as usual ;)

ducha-aiki commented 8 years ago

@bhack yes, I have contacted that guy :) Now testing ResNet101 (as in paper, + without scale/bias, + without BN) without last block, because activation size is too small for it.

1adrianb commented 8 years ago

@ducha-aiki, Did you had any look with the training of ResNet on caffe? Thanks

ducha-aiki commented 8 years ago

@1adrianb there is even one successful attempt here https://github.com/ducha-aiki/caffenet-benchmark/blob/master/ResNets.md But at the most my trials, they overfit a lot. Which probably means that 128 px is not enough for them. And I think that after paper "Identity Mappings in Deep Residual Networks" http://arxiv.org/abs/1603.05027 I literally haven't anything to test in ResNets setup which authors don`t check. So I refer you to this excellent paper.

1adrianb commented 8 years ago

@ducha-aiki thanks a lot for your prompt answer and for the recommended paper. I saw in your implementation that you are not using the Scale layer to learn the parameters, is there any reason for not doing so, like in the model posted by Kaiming He?

ducha-aiki commented 8 years ago

@1adrianb I use it...sometimes. Here are tests of the batchnorm and some of them use scale bias, while others not. Reason - test them all :) https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

1adrianb commented 8 years ago

@ducha-aiki makes sense :) Thanks again for your help!

ducha-aiki commented 8 years ago

Now vgg16 with all tricks is in training (to check if they for weaker models only)

wangxianliang commented 8 years ago

Hi, have you tried pre-activation resnet described in paper http://arxiv.org/abs/1603.05027?

And what about Inception-ResNet-v2 described in paper http://arxiv.org/abs/1602.07261?

ducha-aiki commented 8 years ago

@wangxianliang Well, not and not planning to do it in the nearest future. Reasons: 1)This benchmark originally is test of "we propose cool thing and test it on CIFAR-100" papers. Or to evaluate missing design choices. The both papers you give, have make a good evaluation of full ImageNet. 2) They both are very time-consuming. Yes, I have tested vggnet and googlenet for reference, but mostly cheaper network. 3) They have a complex structure...so if you could write a prototxt, this will greatly increase chances that I will run them in meantime :) Will be grateful for help.

ibmua commented 8 years ago

It's a good idea to test influence of LSUV init time batch sizes on large networks with highly variant data. It seems in the paper that you've only tested this for some tanh tiny Cifar-10 net with some relatively tiny batches and even then a positive correlation between a batch size and performance could be seen. It seems obvious, that it is of much greater influence when it comes to large networks and largely variable data with, say, 1000s of classification outputs. For example, I'm currently training a net for classifying 33.5k mouse-painted symbols. That Cifar-10 data tells me nothing.

Testing 6 different batches on my data: poolconv_2 variance = 0.0515101 mean = 0.206088 poolconv_2 variance = 0.0517261 mean = 0.206058 poolconv_2 variance = 0.0521883 mean = 0.206072 poolconv_2 variance = 0.989368 mean = 0.22629 poolconv_2 variance = 0.995676 mean = 0.226181 poolconv_2 variance = 0.998411 mean = 0.22653

One of the reasons why testing batch influence is so important is that if it really improves things, we can switch to computing LSUV variance not via mini-batches, but via iterations. And use, say, 100k batch rather than 1000. And people don't really use tanh much nowadays, so.. personally I don't really get why you used that one for batch size benchmarking. Tells about nothing, really.

ducha-aiki commented 8 years ago

I have used tanh, because LSUV worked the worst for it. My experience with ImageNet confirms, that any batch size > 32 is OK for LSUV, if data is shuffled. Your data looks imbalanced or not shuffled. If you will do such test on your dataset, I will definitely add to to this benchmark :)

ibmua commented 8 years ago

Yes, it actually is not very shuffled. Which means that I have to use a larger batch here to get something more like I would get with a smaller one if it was shuffled.

The problem with benchmarking is that I don't have the required stuff setup on my computer and it would take me several hours to do that.

ibmua commented 8 years ago

It would be great if you noted time it took to converge different stuff. Both epochs and actual time.

ducha-aiki commented 8 years ago

@ibmua for epochs it is very easy - 320K everywhere :) As for times - it is hard (I am lazy) because some of the trainings consist of lots of save-load with pauses between them. But I am going to do this...once :)

ibmua commented 8 years ago

Yeah, I guess that would just take

import time start = time.time() end = time.time()

, logging spent time to a seperate file during saves and reading during loads. Not too hard. Would give a better sense of difference between different methods. For example, training with momentum is great, but it takes more time.

ducha-aiki commented 8 years ago

@ibmua

import time start = time.time() end = time.time()

Are you really thinking I do it in python? You are underestimating my laziness ;) Everything I do is editing *.prototxt, then run caffe train --solver=current_net.prototxt > logfile.log Then I call https://github.com/BVLC/caffe/blob/master/tools/extra/plot_training_log.py.example file with default keys to get a graphs :)

ibmua commented 8 years ago

Wow, okay. Guess that would take editing plot_training_log.py.example then.

ibmua commented 8 years ago

Actually, easier. Just measure a single forward-backwards pass on the side and then multiply by #epochs from the log. Having info whether the things converged or you just cut it off because it would take >320k iterations would help too. Also, graphs would be a pleasant addition.

ibmua commented 8 years ago

It's interesting to find out how ResNets perform for different activations. PReLU was coauthored by Kaiming He - coauthor of ResNets, but for some reason he used ReLUs in ResNets.

ducha-aiki commented 7 years ago
wangg12 commented 7 years ago

How about SELU @ducha-aiki ?

simbaforrest commented 7 years ago

@ducha-aiki In the SELU paper they seem to suggest using MSRA-like initialization for weights, while your prototxt seem to use fixed Gaussian initialization? Could this affect the result?

ducha-aiki commented 7 years ago

@simbaforrest I don't use fixed initialization, I use LSUV init https://arxiv.org/abs/1511.06422 which is stated on each page of this repo :) What is in prototxt, does not matter, because LSUV should be run as separate script. LSUV uses data to adjust weights values, so if MSRA is optimal, then LSUV will output smth similar to MSRA.

simbaforrest commented 7 years ago

Got it, so I guess selu might be only good for FNN instead of CNN? ......

2017年8月21日 02:16,"Dmytro Mishkin" notifications@github.com写道:

@simbaforrest https://github.com/simbaforrest I don't use fixed initialization, I use LSUV init https://arxiv.org/abs/1511.06422 which is stated on each page of this repo :) What is in prototxt, does not matter, because LSUV should be run as separate script. LSUV uses data to adjust weights values, so if MSRA is optimal, then LSUV will output smth similar to MSRA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ducha-aiki/caffenet-benchmark/issues/1#issuecomment-323655437, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuQpo3O3CLsqSo66oWkJtYYEtEsdKZqks5saSCggaJpZM4G8wV8 .

ducha-aiki commented 7 years ago

Well, it is still possible that SELU is a very architecture dependent.