Be patient and pay attention to seemingly trivial hyperparameter settings

rasbt commented 6 years ago

Have you checked the list of proposed rules to see if the rule has already been proposed?

[x] Yes

In general, getting DL models to work on even simple, structured datasets requires much extensive hyperparameter tuning (Koutsoukas et al. 2017) [Could also add a rule saying that a 2-layer multi-layer perceptron is not deep learning ;)] compared to "traditional" machine learning. Hence, it is important to be patient and try many different hyperparameter settings and combinations.

For instance, the previously "best" learning rate might become useless if we add another layer or change the activation function. Hence, extensive tuning and a near-exhaustive search is recommended.

Furthermore, it should be stressed that even the same architecture and hyperparameter configuration should be tested multiple times with different random seeds since random weight initialization may make a difference between convergence and non-convergence. For model selection and evaluation, it is recommended is to evaluate models based on their average performance based on at least a "top 3 out of 5 models for a given hyperparameter setting" comparison.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Koutsoukas et al. 2017 DOI

agitter commented 6 years ago

A corollary of this rule is that DL is compute-intensive. Be prepared to train many models when starting a project.

rasbt commented 6 years ago

That's a nice way to put it! Basically, we want to highlight that when using DL (as opposed to e.g. a deterministic KNN algorithm, decision tree, SVM, etc.), the hyperparameter tuning is exponentially/way more expensive, because

there are usually more hyperparameters to search over (esp. if you take architecture modifications into account)
each hyperparameter setting should be tried multiple times with different random seeds, because, the models may converge to different local minima or don't converge at all, so some models will be more useful than others, and some will not be useful at all even if all the settings are the same (excl. the random seed).

evancofer commented 5 years ago

We might want to add something about effective but simple hyperparameters/architecture modifications to experiment with (e.g. dropout, batch normalization). It may also be worth mentioning that adaptive learning methods (e.g. Adam) can help save time when determining a viable model architecture (i.e. before even optimizing hyperparameters).

rasbt commented 5 years ago

Sure :) Maybe also adding skip connections to the list

fmaguire commented 5 years ago

Everything mentioned seems covered between (Tip 5 WIP) #134 and Tip 3 (#124)

Benjamin-Lee / deep-rules

Be patient and pay attention to seemingly trivial hyperparameter settings #42