[TensorFlow] Failed to get reproducible trainings with albumentations included to the data pipeline

roma-glushko commented 3 years ago

🐛 Bug

I could not get my training work in reproducible way when albumentations added to the data pipeline. I followed this thread https://github.com/albumentations-team/albumentations/issues/93 and fixed all possible seeds, so in overall my snippet that should have enabled reproducible experiments looks like this:

import os
import random

import numpy as np
import tensorflow as tf

def set_random_seed(seed: int = 42):
    """
    Globally fix all possible sources of randomness to keep experiment reproducible 
    """
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    os.environ['TF_CUDNN_DETERMINISTIC'] = '1'

Unfortunately, this doesn't help me to get reproducible results. I have executed training process 6 times and got all different results. You can also see the whole picture in W&B:

https://wandb.ai/roma-glushko/rock-paper-scissors/runs/2bdgnbwx (best_val_acc: 0.7104, best_epoch: 3)
https://wandb.ai/roma-glushko/rock-paper-scissors/runs/2qo9pbls (best_val_acc: 0.7875, best_epoch: 8)
https://wandb.ai/roma-glushko/rock-paper-scissors/runs/uf6cknge (best_val_acc: 0.6771, best_epoch: 8)
https://wandb.ai/roma-glushko/rock-paper-scissors/runs/tem3umbx (best_val_acc: 0.7729, best_epoch: 6)
https://wandb.ai/roma-glushko/rock-paper-scissors/runs/czsjm7px (best_val_acc: 0.7208, best_epochs: 0 and 8)
https://wandb.ai/roma-glushko/rock-paper-scissors/runs/29dif98z (best_val_acc: 0.8, best_epoch: 9)

Mean: 0.74478
Std: 0.044726

Also, I tried to set random.seed() right before passing my batch into a.Compose() pipeline. That did not really help.

However, when I comment out albumentations from my data pipeline or replace it with some pure TF augmentations, I can get my training reproducible.

Any clues what's wrong here?

To Reproduce

Steps to reproduce the behavior:

Clone the project state at 0.1.0-bugrep tag:

git clone --depth 1 --branch 0.1.0-bugrep https://github.com/roma-glushko/rock-paper-scissor

Pull dataset:

cd data
kaggle datasets download --unzip frtgnn/rock-paper-scissor

Install project deps:
```
poetry install
```
Uncomment any of the reported augmentations in the config file (they are all commented out in the git): https://github.com/roma-glushko/rock-paper-scissor/blob/master/configs/basic_config.py
Run training a couple of times and you get results that differs by a lot:

python train.py

Expected behavior

In order to do experiments that analyze impact of different ideas and changes, I would like to see my training process reproducible.

Environment

Albumentations version (e.g., 0.1.8): 0.5.2
Python version (e.g., 3.7): 3.8.6
OS (e.g., Linux): Ubuntu 20.10
How you installed albumentations (conda, pip, source): poetry (pip-like)
tensorflow-gpu: 2.5.0 (for the sake of compatibility with RTX3070 (ampere arch.))

Additional context

This report is reproduced in a project that is also mentioned in https://github.com/albumentations-team/albumentations/issues/905

The data pipeline is the same for both issues:

def augment_image(inputs, labels, augmentation_pipeline: a.Compose):
    def apply_augmentation(images):
        aug_data = augmentation_pipeline(image=images.astype('uint8'))
        return aug_data['image']

    inputs = tf.numpy_function(func=apply_augmentation, inp=[inputs], Tout=tf.uint8)

    return inputs, labels

def get_dataset(
        dataset_path: str,
        subset_type: str,
        augmentation_pipeline: a.Compose,
        validation_fraction: float = 0.2,
        batch_size: int = 32,
        image_size: Tuple[int, int] = (300, 300),
        seed: int = 42
) -> tf.data.Dataset:
    augmentation_func = partial(
        augment_image,
        augmentation_pipeline=augmentation_pipeline,
    )

    dataset = image_dataset_from_directory(
        dataset_path,
        subset=subset_type,
        class_names=class_names,
        validation_split=validation_fraction,
        image_size=image_size,
        batch_size=batch_size,
        seed=seed,
    )

    return dataset \
        .map(augmentation_func, num_parallel_calls=AUTOTUNE) \
        .prefetch(AUTOTUNE)

BloodAxe commented 3 years ago

Do you observe same behavior when not using any augmentations? PS: usually you don’t want to apply augmentations at validation stage PPS: Pytorch is better

Вс, 23 мая 2021 г. в 12:30, Roman Glushko @.***>:

🐛 Bug

I could get my training work in reproducible way when albumentations added to the data pipeline. I followed this thread #93 https://github.com/albumentations-team/albumentations/issues/93 and fixed all possible seeds, so in overall my snippet that should have enabled reproducible experiments looks like this:

import os import random

import numpy as np import tensorflow as tf

def set_random_seed(seed: int = 42):
"""
Globally fix all possible sources of randomness to keep experiment reproducible
"""

random.seed(seed)

np.random.seed(seed)

tf.random.set_seed(seed)

os.environ['PYTHONHASHSEED'] = str(seed)

os.environ['TF_DETERMINISTIC_OPS'] = '1'

os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
Unfortunately, this doesn't help me to get reproducible results. I have executed training process 6 times and got all different results. You can also see it in W&B:

https://wandb.ai/roma-glushko/rock-paper-scissors/runs/2bdgnbwx (best_val_acc: 0.7104, best_epoch: 3)

https://wandb.ai/roma-glushko/rock-paper-scissors/runs/2qo9pbls (best_val_acc: 0.7875, best_epoch: 8)

https://wandb.ai/roma-glushko/rock-paper-scissors/runs/uf6cknge (best_val_acc: 0.6771, best_epoch: 8)

https://wandb.ai/roma-glushko/rock-paper-scissors/runs/tem3umbx (best_val_acc: 0.7729, best_epoch: 6)

https://wandb.ai/roma-glushko/rock-paper-scissors/runs/czsjm7px (best_val_acc: 0.7208, best_epochs: 0 and 8)

https://wandb.ai/roma-glushko/rock-paper-scissors/runs/29dif98z (best_val_acc: 0.8, best_epoch: 9)

[image: Screenshot 2021-05-23 at 12 29 29] https://user-images.githubusercontent.com/9402690/119255115-98690100-bbc2-11eb-90c9-6c591dbfe629.png

Also, I tried to set random.seed() right before passing my batch into a.Compose() pipeline. That did not really help.

However, when I comment out albumentations from my data pipeline or replace it with some pure TF augmentations, I can get my training reproducible.

Any clues what's wrong here? To Reproduce

Steps to reproduce the behavior:

Clone the project state at 0.1.0-bugrep tag:

git clone --depth 1 --branch 0.1.0-bugrep https://github.com/roma-glushko/rock-paper-scissor

Pull dataset:

cd data

kaggle datasets download --unzip frtgnn/rock-paper-scissor

Install project deps:

poetry install

1.

Uncomment any of the reported augmentations in the config file (they are all commented out in the git):

https://github.com/roma-glushko/rock-paper-scissor/blob/master/configs/basic_config.py 2.

Run training a couple of times and you get results that differs by a lot:

python train.py

Expected behavior

In order to do experiments that analyze impact of different ideas and changes, I would like to see my training process reproducible. Environment

Albumentations version (e.g., 0.1.8): 0.5.2

Python version (e.g., 3.7): 3.8.6

OS (e.g., Linux): Ubuntu 20.10

How you installed albumentations (conda, pip, source): poetry (pip-like)

tensorflow-gpu: 2.5.0 (for the sake of compatibility with RTX3070 (ampere arch.))

Additional context

This report is reproduced in a project that is also mentioned in #905 https://github.com/albumentations-team/albumentations/issues/905

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/albumentations-team/albumentations/issues/906, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEB6YC6NZF4H4ROOKQT5E3TPDDMBANCNFSM45LQYTDQ .

roma-glushko commented 3 years ago

@BloodAxe thank you for the replay!

Do you observe same behavior when not using any augmentations?

No, when I disable augmentations, the pipeline becomes deterministic in my experiments. It overfits and do it in the same way every time I rerun it (each epoch stats looks the same)

PS: usually you don’t want to apply augmentations at validation stage

Will try to disable augmentation for the validation step. Let you know how it went.

PPS: Pytorch is better

I know, I know 😌 In this particular project I use TF because of TF.js. I want to deploy my model as a server-less webapp.

roma-glushko commented 3 years ago

@BloodAxe, back to your suggestions: Here is 6 train runs with validation augmentation disabled (6 runs are shown):

Don't see a lot of differences to the previous runs with val augmentation enabled.

Here is how training run looks like when I disable albumentations completely (7 runs are shown):

On the last plot, this is what I consider to be a reproducible pipeline: all metrics are the same/close at all epochs.

BloodAxe commented 3 years ago

@roma-glushko may I ask for another trial? What if you fix a seed inside the apply_augmentation function? That is to ensure tf.numpy_function does not introduce any unexpected issues with pseudorandom generator. This test will apply exactly same set of augmnetations, and results should be identical.

roma-glushko commented 3 years ago

@BloodAxe I saw this usage in the examples associated with TF usage, so I tried that even before creating the ticket. However, I have just double checked that and still see the same undeterministic picture in W&B:

Just for the record, the function was modified this way:

def augment_image(inputs, labels, augmentation_pipeline: a.Compose, seed: int = 42):
    def apply_augmentation(images):
        random.seed(seed)  # fixing seed
        aug_data = augmentation_pipeline(image=images.astype('uint8'))
        return aug_data['image']

    inputs = tf.numpy_function(func=apply_augmentation, inp=[inputs], Tout=tf.uint8)

    return inputs, labels

Dipet commented 3 years ago

Try also to set a numpy.random.seed(seed).

roma-glushko commented 3 years ago

@Dipet It has been already enabled in the entry point from the very beginning as I mentioned in the ticket, so I have tried to add the line to augment_image() function (which I had not checked before):

def augment_image(inputs, labels, augmentation_pipeline: a.Compose, seed: int = 42):
    def apply_augmentation(images):
        random.seed(seed)
        np.random.seed(seed)

        aug_data = augmentation_pipeline(image=images.astype('uint8'))
        return aug_data['image']

    inputs = tf.numpy_function(func=apply_augmentation, inp=[inputs], Tout=tf.uint8)

    return inputs, labels

Unfortunately, 5 additional runs show that the picture has not changed much:

Dipet commented 3 years ago

Very strange. There are only 2 things in the library that control randomness. Could you describe which transforms do you use?

roma-glushko commented 3 years ago

@Dipet sure, all tests were performed with the following configuration of augmentation pipeline:

args['train_augmentation'] = a.Compose([
    a.VerticalFlip(),
    a.HorizontalFlip(),
    a.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.1, brightness_by_max=False),
    a.CoarseDropout(max_holes=20, max_height=8, max_width=8, min_holes=10, min_height=8, min_width=8),
    a.GaussNoise(p=1.0, var_limit=(10.0, 50.0)),
])

args['validation_augmentation'] = a.Compose([])

I kept validation step augmentation-free as @BloodAxe suggested above.

BloodAxe commented 3 years ago

Hmm. All of a sudden, this issue starts looking more interesting than at the beginning.

Чт, 27 мая 2021 г. в 11:57, Roman Glushko @.***>:

@Dipet https://github.com/Dipet sure, all tests were performed with the following configuration of augmentation pipeline:

args['train_augmentation'] = a.Compose([ a.VerticalFlip(), a.HorizontalFlip(), a.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.1, brightness_by_max=False), a.CoarseDropout(max_holes=20, max_height=8, max_width=8, min_holes=10, min_height=8, min_width=8), a.GaussNoise(p=1.0, var_limit=(10.0, 50.0)), ]) args['validation_augmentation'] = a.Compose([])

I kept validation step augmentation-free as @BloodAxe https://github.com/BloodAxe suggested above.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/albumentations-team/albumentations/issues/906#issuecomment-849462738, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEB6YDBTXNOLU5VJ6VVWLTTPYCPFANCNFSM45LQYTDQ .

Dipet commented 3 years ago

As another check you could use ReplayCompose and serialize all applied arguments. After that you could rerun and check if all arguments the same or not.

roma-glushko commented 3 years ago

@Dipet I have noticed an interesting thing. After I enabled ReplayMode, I have started to see less variance in the training loss/accuracy, but validation metrics still vary by a lot:

In addition, there were new warnings related to GaussNoise and CoarseDropout augmentations:

Epoch 1/10
UserWarning: albumentations.augmentations.transforms.GaussNoise could work incorrectly in ReplayMode for other input data because its' params depend on targets.
  warn(
46/63 [====================>.........] - ETA: 1s - loss: 1.2924 - accuracy: 0.3268
UserWarning: albumentations.augmentations.transforms.CoarseDropout could work incorrectly in ReplayMode for other input data because its' params depend on targets.
63/63 [==============================] - 9s 86ms/step - loss: 1.2859 - accuracy: 0.3304 - val_loss: 1.1152 - val_accuracy: 0.3417
...

So the only changes I have done was:

args['train_augmentation'] = a.ReplayCompose([  # ReplayCompose() replaced Compose() method
    a.VerticalFlip(),
    a.HorizontalFlip(),
    a.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.1, brightness_by_max=False),
    a.CoarseDropout(max_holes=20, max_height=8, max_width=8, min_holes=10, min_height=8, min_width=8),
    a.GaussNoise(p=1.0, var_limit=(10.0, 50.0)),
])

Dipet commented 3 years ago

Warnings are ok. I talked about saving applied arguments. Something like this:

applied_transforms = []

def augment_image(inputs, labels, augmentation_pipeline: a.Compose, seed: int = 42):
    def apply_augmentation(images):
        random.seed(seed)
        np.random.seed(seed)

        aug_data = augmentation_pipeline(image=images.astype('uint8'))
        applied_transforms.append(data['replay'])
        return aug_data['image']

    inputs = tf.numpy_function(func=apply_augmentation, inp=[inputs], Tout=tf.uint8)

    return inputs, labels

# train
....

# save after train 
with open('data.pickle', 'wb') as f:
    pickle.dump(applied_transforms, f)

And after that we could compare applied arguments and transforms.

roma-glushko commented 3 years ago

@Dipet yeah, just was in process of collecting the information. Although, I decided to store each augmentation run in a different pkl file:

def augment_image(inputs, labels, augmentation_pipeline: a.Compose, seed: int = 42):
    def apply_augmentation(images):
        random.seed(seed)
        np.random.seed(seed)

        aug_data = augmentation_pipeline(image=images.astype('uint8'))

        with open(f'logs/debug/replay-{datetime.datetime.now().timestamp()}.pkl', 'wb') as outfile:
            pickle.dump(aug_data['replay'], outfile)

        return aug_data['image']

    inputs = tf.numpy_function(func=apply_augmentation, inp=[inputs], Tout=tf.uint8)

    return inputs, labels

I hope you are okay with that.

Here is a zip archive with a few files generated by snippet above:

https://drive.google.com/file/d/1lH-YuY4abcVYk12cCwXJm5PAAUd1kXS5/view?usp=sharing

roma-glushko commented 3 years ago

@Dipet have you had a chance to open the replay "black box" of the albumentations I have shared with you? 😄

Dipet commented 3 years ago

Oh, sorry. Looks like some of files are corrupted (has 0 size). And if we talk about reproducibility, it would be great to have 2 groups of files from 2 independent runs. Looks like you had a problems with trying to process batches inside albumentations pipeline. Have you tried to reproduce the results after fixing this issue?

roma-glushko commented 3 years ago

@Dipet glad you wrote back 🙌

I think the fix from #911 greatly mitigated the variance of the metrics. Here is what I can see now:

Currently, losses and accuracies roughly vary by +-0.01. Is this something we expect to see?

Dipet commented 3 years ago

Looks good. I think current differences associated with the instability of algorithms and hardware.

roma-glushko commented 3 years ago

@Dipet I believe so. At least, I have no augmentations on the validation step, so it seems nothing to do with albumentations. In any case, thank you for the support! Appreciate your help ❤️

albumentations-team / albumentations