Sampling interferes with second-order batch methods like L-BFGS-B, but dropout is probably important.
Sida Wang and Christoper Manning's "Fast Dropout Training" might be optimal. In the mean time, the variant of an L2 penalty they suggest in between Eqn 9 and Eqn 10 is easy: instead of just having the penalty for weight ij be proportional to its square, multiply that square by c, which is basically the variance of the input variable multiplied by that weight.
Also, batch normalization might solve some of the same problems?
Sampling interferes with second-order batch methods like L-BFGS-B, but dropout is probably important.
Sida Wang and Christoper Manning's "Fast Dropout Training" might be optimal. In the mean time, the variant of an L2 penalty they suggest in between Eqn 9 and Eqn 10 is easy: instead of just having the penalty for weight ij be proportional to its square, multiply that square by
c
, which is basically the variance of the input variable multiplied by that weight.Also, batch normalization might solve some of the same problems?