Open ducha-aiki opened 8 years ago
Is this worth testing?
"Deep Learning with S-shaped Rectified Linear Activation Units"
Well, I don`t like activations which could not be done "in-place", but if SReLU will be implemented in caffe, I test it.
@ducha-aiki
I learn a lot from your this caffenet-benchmark, is there anything I can try to speedup the benchmarks? Thanks
@Darwin2011 Thank you! I can see two ways you can help: 1)Do your own benchmark and make PR like @dereyly done. 2)In caffenet-style networks, the real bottleneck, as I recently found is not GPU, but input data layer https://github.com/BVLC/caffe/pull/2252, which does jpeg decompression AND real-time rescaling. Its multicore implementation would help much.
Take a look at https://github.com/KaimingHe/deep-residual-networks/
@bhack thanks, as usual ;)
@bhack yes, I have contacted that guy :) Now testing ResNet101 (as in paper, + without scale/bias, + without BN) without last block, because activation size is too small for it.
@ducha-aiki, Did you had any look with the training of ResNet on caffe? Thanks
@1adrianb there is even one successful attempt here https://github.com/ducha-aiki/caffenet-benchmark/blob/master/ResNets.md But at the most my trials, they overfit a lot. Which probably means that 128 px is not enough for them. And I think that after paper "Identity Mappings in Deep Residual Networks" http://arxiv.org/abs/1603.05027 I literally haven't anything to test in ResNets setup which authors don`t check. So I refer you to this excellent paper.
@ducha-aiki thanks a lot for your prompt answer and for the recommended paper. I saw in your implementation that you are not using the Scale layer to learn the parameters, is there any reason for not doing so, like in the model posted by Kaiming He?
@1adrianb I use it...sometimes. Here are tests of the batchnorm and some of them use scale bias, while others not. Reason - test them all :) https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md
@ducha-aiki makes sense :) Thanks again for your help!
Now vgg16 with all tricks is in training (to check if they for weaker models only)
Hi, have you tried pre-activation resnet described in paper http://arxiv.org/abs/1603.05027?
And what about Inception-ResNet-v2 described in paper http://arxiv.org/abs/1602.07261?
@wangxianliang Well, not and not planning to do it in the nearest future. Reasons: 1)This benchmark originally is test of "we propose cool thing and test it on CIFAR-100" papers. Or to evaluate missing design choices. The both papers you give, have make a good evaluation of full ImageNet. 2) They both are very time-consuming. Yes, I have tested vggnet and googlenet for reference, but mostly cheaper network. 3) They have a complex structure...so if you could write a prototxt, this will greatly increase chances that I will run them in meantime :) Will be grateful for help.
It's a good idea to test influence of LSUV init time batch sizes on large networks with highly variant data. It seems in the paper that you've only tested this for some tanh tiny Cifar-10 net with some relatively tiny batches and even then a positive correlation between a batch size and performance could be seen. It seems obvious, that it is of much greater influence when it comes to large networks and largely variable data with, say, 1000s of classification outputs. For example, I'm currently training a net for classifying 33.5k mouse-painted symbols. That Cifar-10 data tells me nothing.
Testing 6 different batches on my data: poolconv_2 variance = 0.0515101 mean = 0.206088 poolconv_2 variance = 0.0517261 mean = 0.206058 poolconv_2 variance = 0.0521883 mean = 0.206072 poolconv_2 variance = 0.989368 mean = 0.22629 poolconv_2 variance = 0.995676 mean = 0.226181 poolconv_2 variance = 0.998411 mean = 0.22653
One of the reasons why testing batch influence is so important is that if it really improves things, we can switch to computing LSUV variance not via mini-batches, but via iterations. And use, say, 100k batch rather than 1000. And people don't really use tanh much nowadays, so.. personally I don't really get why you used that one for batch size benchmarking. Tells about nothing, really.
I have used tanh, because LSUV worked the worst for it. My experience with ImageNet confirms, that any batch size > 32 is OK for LSUV, if data is shuffled. Your data looks imbalanced or not shuffled. If you will do such test on your dataset, I will definitely add to to this benchmark :)
Yes, it actually is not very shuffled. Which means that I have to use a larger batch here to get something more like I would get with a smaller one if it was shuffled.
The problem with benchmarking is that I don't have the required stuff setup on my computer and it would take me several hours to do that.
It would be great if you noted time it took to converge different stuff. Both epochs and actual time.
@ibmua for epochs it is very easy - 320K everywhere :) As for times - it is hard (I am lazy) because some of the trainings consist of lots of save-load with pauses between them. But I am going to do this...once :)
Yeah, I guess that would just take
import time start = time.time() end = time.time()
, logging spent time to a seperate file during saves and reading during loads. Not too hard. Would give a better sense of difference between different methods. For example, training with momentum is great, but it takes more time.
@ibmua
import time start = time.time() end = time.time()
Are you really thinking I do it in python? You are underestimating my laziness ;) Everything I do is editing *.prototxt, then run caffe train --solver=current_net.prototxt > logfile.log Then I call https://github.com/BVLC/caffe/blob/master/tools/extra/plot_training_log.py.example file with default keys to get a graphs :)
Wow, okay. Guess that would take editing plot_training_log.py.example then.
Actually, easier. Just measure a single forward-backwards pass on the side and then multiply by #epochs from the log. Having info whether the things converged or you just cut it off because it would take >320k iterations would help too. Also, graphs would be a pleasant addition.
It's interesting to find out how ResNets perform for different activations. PReLU was coauthored by Kaiming He - coauthor of ResNets, but for some reason he used ReLUs in ResNets.
CReLU is on the way https://arxiv.org/pdf/1603.05201.pdf
How about SELU @ducha-aiki ?
@wangg12 It is already there https://github.com/ducha-aiki/caffenet-benchmark/blob/master/Activations.md worse than ELU. But may be I have made a mistake, you can check me in https://github.com/ducha-aiki/caffenet-benchmark/blob/master/prototxt/activations/caffenet128_lsuv_SELU.prototxt#L92
@ducha-aiki In the SELU paper they seem to suggest using MSRA-like initialization for weights, while your prototxt seem to use fixed Gaussian initialization? Could this affect the result?
@simbaforrest I don't use fixed initialization, I use LSUV init https://arxiv.org/abs/1511.06422 which is stated on each page of this repo :) What is in prototxt, does not matter, because LSUV should be run as separate script. LSUV uses data to adjust weights values, so if MSRA is optimal, then LSUV will output smth similar to MSRA.
Got it, so I guess selu might be only good for FNN instead of CNN? ......
2017年8月21日 02:16,"Dmytro Mishkin" notifications@github.com写道:
@simbaforrest https://github.com/simbaforrest I don't use fixed initialization, I use LSUV init https://arxiv.org/abs/1511.06422 which is stated on each page of this repo :) What is in prototxt, does not matter, because LSUV should be run as separate script. LSUV uses data to adjust weights values, so if MSRA is optimal, then LSUV will output smth similar to MSRA.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ducha-aiki/caffenet-benchmark/issues/1#issuecomment-323655437, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuQpo3O3CLsqSo66oWkJtYYEtEsdKZqks5saSCggaJpZM4G8wV8 .
Well, it is still possible that SELU is a very architecture dependent.
Pooling: AVG-pooling caffenet Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree http://arxiv.org/abs/1509.08985 -- all three from paper, thanks authors for code.weight decay values , L1\L2 weights decay, dropout rates.Freeze conv structure and play with fc6-fc8 classifier. Maxout? More layers? Convolution? Inspired by http://arxiv.org/abs/1504.06066, but in end-to-end style.Solvers: default caffenet + ADAM\RMSProp\Nesterov\"poly" policyBatchNorm for blocks of layers, not each.-For fully convolutional nets, what is better - avg pool on features, then classifier, or other way round</SqueezeNet https://github.com/DeepScale/SqueezeNetSuggestions and training logs from community are welcomed.