model training loss and validation loss negatively affected when adding imgaug augmentations. Why?

I'm training an instance segmentation model on ~1700 images, 512x512 grayscale using Matterport's Mask RCNN.

When I don't use image augmentation, the training loss decreases nicely from ~3.0 to ~0.2 after 150 epochs, using a fairly reasonable LEARNING_RATE = 0.001 and LEARNING_MOMENTUM = 0.9

The validation loss decreases to about 1.8 after ~50 epochs, then begins to increase steadily (classic sign of overfitting).

When I add in the following sequence of image transformations to the model, the validation loss follows a similar trajectory, and the training loss only decreases to ~1.6.

Augmentation is supposed to help prevent overfitting so one would expect validation loss to decrease better than when running the model without augmentation. Furthermore, why would this be negatively affecting the training loss?

1) Would any of these transformations be causing the training loss to not decrease, and if so, which ones?

2) Is there a generally accepted heuristic for choosing specific image transformations when training an instance segmentation model? Obviously the transforms depends a bit on your specific image data, but I was surprised to find no general guidelines anywhere on the interwebs for configuring which image transformations to apply and by how much.

For example, use random_scaling, but, what range? (0.9, 1.1)? Or broaden your zooms to (0.8, 1.2)? Use random_crops, but Perhaps random horizontal flipping is applicable to your specific image data, but what to set the argument of imgaug.augmenters.Fliplr() ? percent=(0, 0.1) ?

3) I would hypothesize that not all image transformations have an equal impact on your model's validation loss. Is there a ranking of which specific transformations have the "biggest bang for the buck"? If so, which ones?

Thanks all.

`seq_of_aug = iaa.Sequential([ iaa.Crop(percent=(0, 0.1)), # random crops

# horizontally flip 50% of the images
#iaa.Fliplr(0.5), # Does not make sense for signs

# Gaussian blur to 50% of the images
# with random sigma between 0 and 0.5.
iaa.Sometimes(0.4,
    iaa.GaussianBlur(sigma=(0, 0.5))
),

# Strengthen or weaken the contrast in each image.
iaa.ContrastNormalization((0.75, 1.5)),

# Add gaussian noise.
# For 50% of all images, we sample the noise once per pixel.
# For the other 50% of all images, we sample the noise per pixel AND
# channel. This can change the color (not only brightness) of the
# pixels.
iaa.AdditiveGaussianNoise(loc=0, scale=(0.0, 0.05*255), per_channel=0.5),

# Make some images brighter and some darker.
# In 20% of all cases, we sample the multiplier once per channel,
# which can end up changing the color of the images.
iaa.Multiply((0.8, 1.2), per_channel=0.2),

# Apply affine transformations to each image.
# Scale/zoom them from 90% 5o 110%
# Translate/move them, rotate them
# Shear them slightly -2 to 2 degrees.
iaa.Affine(
    scale={"x": (0.9, 1.1), "y": (0.9, 1.1)},
    translate_percent={"x": (-0.2, 0.2), "y": (-0.2, 0.2)},
    rotate=(-5, 5),
    shear=(-2, 2)
)

], random_order=True) # apply augmenters in random order`

Any suggestions?

Augmentations not only help against memorization, they also make the model more invariant towards specific features in the image. E.g. applying gaussian blur might make the model more invariant (aka robust) towards gaussian blur -- and thereby hopefully also towards other forms of blurring. This can however become a problem if your validation set doesn't contain any blurred images. Then that robustness is pointless and will usually hurt your KPIs. Possible reasons for that could be that the model has to reserve parameters towards achieving such invariance or that the statistics of the validation input data do not reflect the statistics of the training input data (this might especially affect batch normalization). Such a seemingly worse model might still sometimes be preferred as it is hopefully overall more robust (i.e. it has an improved lower bound on the outputs at the price of a slightly worsened upper bound). In academia however this is usually not desired as only the results on the provided validation or test sets matter.

I'm not aware of empirical research that systematically evaluates which augmentations are useful on many datasets. Such research might also not be too useful as the exact choice depends on the characteristics of the training and validation sets. It is probably best to run multiple trainings on your specific dataset, with each training using only exactly one augmentation technique. Then see which ones seem to improve the results.

From experience, there is a high probability that horizontal flips (Fliplr) and random crops help. These also tend to be used most often in papers. The probability of Fliplr can usually be set to 50%. Note though, that even these augmentations can negatively impact the validation score. E.g. Fliplr will make text mirrored in an unrealistic way or make it appear that cars were driving on the wrong side of the road.

aleju / imgaug

model training loss and validation loss negatively affected when adding imgaug augmentations. Why? #453