Training models faster (?) with adam / faster recipes

snakers4 commented 5 years ago

Hi,

Many thanks for your amazing library.

We are building open source STT for Russian language. So far we have assembled 350+ hours of annotation of various quality hoping to open-source it as well as our pre-trained models and code.

But looks like fitting larger models (GLU) is a chore (I saw several tickets here stating 2-14 days on 4x1080Ti for various levels of convergence). I also assume that given the momentum flag you mostly use SGD in your recipes, which also is a common theme in all of the papers.

We are also trying wav2letter++, as well as writing our own models in plain PyTorch. But now we are facing a problem fitting fatter models, which is kind of agnostic to the framework. Obviously you post highly optimized hyper-params in recipe section, and obviously they may not fit other languages.

I saw a mention of Adam in some C++ config file in the repo. From my / my friends' experience in fitting networks on Imagenet (a somewhat comparable task) Adam always performed 2-4x faster at the price of slight reduction in performance (3-5 pp).

Also many practical "hacks" have been important in my experience in fitting NLP / semseg on complex domains, i.e. having good runtime augmentation strategy sometimes reduced training time 10x. It is definitely possible to do so with your library (just generate augs in advance or change sym-links), but such dynamic pipelines shine best with PyTorch.

Anyway - my question is the following. Since we are pursuing the most practical solution (speed vs. good enough performance) for a minority language, which usually is more complex than English (at least in NLP) but has close to zero public datasets - it will be very helpful if you could share your experience / recipes in using Adam / augs / similar ways to train networks faster in this domain.

We could share our dataset / code / findings as well, if you find it useful.

an918tw commented 5 years ago

@snakers4 You can train with Adam by setting --netoptim=adam --critoptim=adam (the default is sgd). Other flags that you will want to play with are --adambeta1, --adambeta2 and --optimepsilon (and of course --lr and --lrcrit). The default values of those flags are set following the original Adam paper (--adambeta1=0.9 --adambeta2=0.999 --optimepsilon=1e-8). Note that another useful trick for training with Adam is learning rate warmup, which we don't currently support, but is on our todo list.

snakers4 commented 5 years ago

@an918tw Many thanks for the flags. Have you tried fitting any of the networks you have in recipes with adam? Any tips on their convergence / convergence time?

an918tw commented 5 years ago

@snakers4 No I have not tried Adam on the network we provided in the recipes. Past experience on using Adam on seq2seq models with CNN+RNN encoder shows that they sometimes can make the model converge faster. We didn't tune too much on the lr schedule, and since SGD consistently gives good generalization error, we stick to using SGD in our experiments at this moment.

flashlight / wav2letter

Training models faster (?) with adam / faster recipes #264