Prohibitive computational cost: require more time to converge and do not improve on existing techniques.
2. Dropout
Dropout approximately integrates over the models' weights. Dropout applied before every weight layer is mathematically equivalent to an approximation to the probabilistic deep Gaussian Process(marginalized over its covariance function parameters): Dropout objective minimizes the KL divergence between an approximate distribution and the posterior of a deep Gaussian Process.
The posterior distribution is intractable. We use q(w), a distribution over matrices whose columns are randomly set to zero, to approximate the intractable posterior.
Pros
Simple implementation
Cons
Dropout rates have to be tuned based on the training data, since any sensible approximation to the true Bayesian posterior distribution has to depend on the training data.
Why ensemble of NNs can be expected to produce good uncertainty estimates?
BMA(Bayesian model averaging): find the single best model within the hypothesis class(several model) < Ensemble: model combination(combine the models to obtain a more powerful model)
"Breiman" showed that the generalization error of random forests can be upper bounded by a function of the strength and correlation between individual trees(random forest) hence it is desirable to use a randomization scheme that de-correlates the predictions of the individual models as well as ensures that the individual models are strong(e.g. high accuracy)
Uncertainty estimation