How to apply data augmentation?

adriangb / scikeras

Scikit-Learn API wrapper for Keras.

https://www.adriangb.com/scikeras/

MIT License

239 stars 47 forks source link

How to apply data augmentation? #327

Closed clstaudt closed 1 month ago

clstaudt commented 1 month ago

It is important that augmentation is used only on the training data, so that the validation and test data does not contain samples that are augmented copies of the training data.

I did not find a way to apply augmentation during KerasClassifier.fit.

adriangb commented 1 month ago

I'm not sure what you mean by data augmentation. Could you point me to ScikitLearn or Keras docs? Thanks

clstaudt commented 1 month ago

@adriangb https://www.tensorflow.org/tutorials/images/data_augmentation

In my case, data augmentation would be used to artificially increase the number of training samples by flipping each image, for example.

The keras ImageDataGenerator class also supports augmentation, but I assume it is not compatible with scikeras. https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/

clstaudt commented 1 month ago

Why not apply augmentation to X_train before passing it to fit, you may ask.

Because this leads to a form of leakage into the validation split: Suppose an image is in the training split and its flipped version is in the validation split. Then the latter is too easy to predict, making the validation performance metrics look too good.

adriangb commented 1 month ago

Keras does the validation split internally, it's not something that sklearn is aware of. Would making the preprocessing layers part of the model itself, as suggested by the first link you included work (see Option 1: Make the preprocessing layers part of your model)? I'd imagine since it's recommended in the tutorial it doesn't lead to any important amount of data leakage.

clstaudt commented 1 month ago

@adriangb Perhaps it would, I'll have to look at it.

Alternatively, should it be possible to pass the validation dataset to the KerasClassifier constructor?

clstaudt commented 1 month ago

@adriangb Adding the augmentation as layers to the network is indeed a working solution.

from keras.layers import RandomFlip, RandomTranslation, RandomBrightness, RandomRotation

cnn_augment = Sequential(
    name="cnn_augment",
    layers=[
        Input(input_shape),

        # augmentation
        RandomFlip(mode="vertical"),
        RandomRotation(factor=0.005, fill_mode="constant", fill_value=0),
        RandomBrightness(factor=0.001),
        RandomTranslation(height_factor=0.00, width_factor=0.02, fill_mode="nearest"),

        # convolution
        Conv2D(32, (3, 3), activation='relu'),  
        MaxPooling2D(2, 2),
        Conv2D(64, (3, 3), activation='relu'),
        MaxPooling2D(2, 2),
        Conv2D(128, (3, 3), activation='relu'),
        MaxPooling2D(2, 2),
        Conv2D(256, (3, 3), activation='relu'),
        MaxPooling2D(2, 2),
        Conv2D(512, (3, 3), activation='relu'),
        MaxPooling2D(2, 2),

        Flatten(),
        Dense(256, activation='relu'),
        Dropout(0.5),
        Dense(1, activation='sigmoid')  
    ]
)

adriangb commented 1 month ago

Thank you for coming back and sharing a solution!