Open lichuanx opened 5 years ago
This is a very good point! It uses three samples for each different hyper-parameter set, in order to average the final performance. One idea to combat overfitting without causing variation in the final performance is using Batch Normalization instead of Dropout. I should try it and see if it's better. Do you have any idea on that?
i am using both batch norm and dropout on my custom dataset.
training images = 226 validation images = 40
model trains with 1356 images and validates on 40 images. however, it generates 0.1 val score on each epoch. is this normal?
This is my model:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 32, 32, 16) 1216
_________________________________________________________________
batch_normalization_1 (Batch (None, 32, 32, 16) 64
_________________________________________________________________
activation_1 (Activation) (None, 32, 32, 16) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 32, 32, 32) 12832
_________________________________________________________________
batch_normalization_2 (Batch (None, 32, 32, 32) 128
_________________________________________________________________
activation_2 (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 16, 16, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 16, 16, 32) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 16, 16, 64) 18496
_________________________________________________________________
batch_normalization_3 (Batch (None, 16, 16, 64) 256
_________________________________________________________________
activation_3 (Activation) (None, 16, 16, 64) 0
_________________________________________________________________
conv2d_4 (Conv2D) (None, 16, 16, 64) 36928
_________________________________________________________________
batch_normalization_4 (Batch (None, 16, 16, 64) 256
_________________________________________________________________
activation_4 (Activation) (None, 16, 16, 64) 0
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 8, 8, 64) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 8, 8, 64) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 4096) 0
_________________________________________________________________
dense_1 (Dense) (None, 256) 1048832
_________________________________________________________________
activation_5 (Activation) (None, 256) 0
_________________________________________________________________
dropout_3 (Dropout) (None, 256) 0
_________________________________________________________________
dense_2 (Dense) (None, 32) 8224
_________________________________________________________________
activation_6 (Activation) (None, 32) 0
_________________________________________________________________
dropout_4 (Dropout) (None, 32) 0
_________________________________________________________________
dense_3 (Dense) (None, 5) 165
_________________________________________________________________
activation_7 (Activation) (None, 5) 0
=================================================================
Total params: 1,127,397
Trainable params: 1,127,045
Non-trainable params: 352
This is a very good point! It uses three samples for each different hyper-parameter set, in order to average the final performance. One idea to combat overfitting without causing variation in the final performance is using Batch Normalization instead of Dropout. I should try it and see if it's better. Do you have any idea on that?
well, in my term, I will use sgd instead of adaptive-opt, cause sgd tend to converged on a flat minimum which shows better generalization. Sgd is much slow than adaptive-opt so that I will change learning rate schedule to cosine-cyclical-learning rate, thus will be more steady outcome. Because we only seek relative better hyper-params not "best" param.
Using Dropout in child_model shows great works on prevent overfitting, however it also cause the final performance on model change significantly during each training with same hyper-params. It is too random that cause that we need using more sampling times to estimate final performance on one hyper-params which is very time consuming. Any ideal for solving this problem.