barisozmen / deepaugment

Discover augmentation strategies tailored for your dataset
MIT License
244 stars 41 forks source link

Dropout Cause significant performance change between each trainning #24

Open lichuanx opened 5 years ago

lichuanx commented 5 years ago

Using Dropout in child_model shows great works on prevent overfitting, however it also cause the final performance on model change significantly during each training with same hyper-params. It is too random that cause that we need using more sampling times to estimate final performance on one hyper-params which is very time consuming. Any ideal for solving this problem.

barisozmen commented 5 years ago

This is a very good point! It uses three samples for each different hyper-parameter set, in order to average the final performance. One idea to combat overfitting without causing variation in the final performance is using Batch Normalization instead of Dropout. I should try it and see if it's better. Do you have any idea on that?

yrg23 commented 5 years ago

i am using both batch norm and dropout on my custom dataset.

training images = 226 validation images = 40

model trains with 1356 images and validates on 40 images. however, it generates 0.1 val score on each epoch. is this normal?

This is my model:

Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 32, 32, 16)        1216      
batch_normalization_1 (Batch (None, 32, 32, 16)        64        
activation_1 (Activation)    (None, 32, 32, 16)        0         
conv2d_2 (Conv2D)            (None, 32, 32, 32)        12832     
batch_normalization_2 (Batch (None, 32, 32, 32)        128       
activation_2 (Activation)    (None, 32, 32, 32)        0         
max_pooling2d_1 (MaxPooling2 (None, 16, 16, 32)        0         
dropout_1 (Dropout)          (None, 16, 16, 32)        0         
conv2d_3 (Conv2D)            (None, 16, 16, 64)        18496     
batch_normalization_3 (Batch (None, 16, 16, 64)        256       
activation_3 (Activation)    (None, 16, 16, 64)        0         
conv2d_4 (Conv2D)            (None, 16, 16, 64)        36928     
batch_normalization_4 (Batch (None, 16, 16, 64)        256       
activation_4 (Activation)    (None, 16, 16, 64)        0         
max_pooling2d_2 (MaxPooling2 (None, 8, 8, 64)          0         
dropout_2 (Dropout)          (None, 8, 8, 64)          0         
flatten_1 (Flatten)          (None, 4096)              0         
dense_1 (Dense)              (None, 256)               1048832   
activation_5 (Activation)    (None, 256)               0         
dropout_3 (Dropout)          (None, 256)               0         
dense_2 (Dense)              (None, 32)                8224      
activation_6 (Activation)    (None, 32)                0         
dropout_4 (Dropout)          (None, 32)                0         
dense_3 (Dense)              (None, 5)                 165       
activation_7 (Activation)    (None, 5)                 0         
Total params: 1,127,397
Trainable params: 1,127,045
Non-trainable params: 352
lichuanx commented 5 years ago

This is a very good point! It uses three samples for each different hyper-parameter set, in order to average the final performance. One idea to combat overfitting without causing variation in the final performance is using Batch Normalization instead of Dropout. I should try it and see if it's better. Do you have any idea on that?

well, in my term, I will use sgd instead of adaptive-opt, cause sgd tend to converged on a flat minimum which shows better generalization. Sgd is much slow than adaptive-opt so that I will change learning rate schedule to cosine-cyclical-learning rate, thus will be more steady outcome. Because we only seek relative better hyper-params not "best" param.