Doodleverse / segmentation_gym

A neural gym for training deep learning models to carry out geoscientific image segmentation. Works best with labels generated using https://github.com/Doodleverse/dash_doodler
MIT License
45 stars 10 forks source link

Rewrite augmentation pipeline #81

Open ebgoldstein opened 1 year ago

ebgoldstein commented 1 year ago

In the TF docs from 2.9 on, tf.keras.preprocessing has a deprecation warning: https://www.tensorflow.org/versions/r2.10/api_docs/python/tf/keras/preprocessing

This will impact the make_data script, which relies on this suite of tools (i.e., tf.keras.preprocessing.image.ImageDataGenerator) to make the augmented imagery. See here: https://github.com/Doodleverse/segmentation_gym/blob/c1669a0147236176442df65cb1cf9776a63e49fe/make_nd_dataset.py#L578-L800

In light of this, it seems wise to think/plan/prepare for the moment when we need to convert the augmentation routines to the recommended workflow using tf.keras.utils.. the relevant links in the TF documentation can be found in the link above.

ebgoldstein commented 1 year ago

note that this has been discussed: https://github.com/Doodleverse/segmentation_gym/discussions/60

dbuscombe-usgs commented 1 year ago

https://albumentations.ai/docs/api_reference/augmentations/ seems best, especially because we are concerned with environmental imagery, and the functional augs include sun glint, snow, and fog https://albumentations.ai/docs/api_reference/augmentations/functional/

dbuscombe-usgs commented 6 months ago

2024 and this is still a christmas wish

I think I could take this on this year and would base it around

dataset = tf.keras.utils.image_dataset_from_directory(
    folder,
    labels='inferred',
    label_mode='int',
    class_names=None,
    batch_size=32,
    image_size=TARGET_SIZE,
    shuffle=False,
    seed=None,
    validation_split=None,
    subset=None,
    interpolation="bilinear"
)
mlundine commented 4 months ago

Question: so I am guessing these augmentations get done at the time of training, and new images are not actually saved? I think it would be easier (at least for me) to integrate albumentations by actually saving the augmented images with the rest of the dataset.

dbuscombe-usgs commented 4 months ago

Correct. Gym works by preparing your dataset for you and making batched tensors of augmented data. This is deliberately done so you always know what data is used for training and what for validation. Importantly only the training data is augmented.

I would recommend we eventually modified the make_dataset.py function with an albumentations based workflow. But yes for now you could trial model training by augmenting the imagery first. But note that would be suboptimal in the long term because it needlessly duplicates image files. So let's put a basic wirkflow together and then ideally wrap that into the existing Gym workflow.

ebgoldstein commented 4 months ago

Just so we are all on the same page - make_datasets actually creates the augmented images, which are saved as npz files. then train_model uses those (augmented) images (which are npz) to train the model. So images are not augmented 'on the fly' like in many workflows (i.e., preprocessing layers in the model, data generators, etc), but rather pre-augmented. I recall the biggest reason we did this was for efficiency (GPU utilization is always near 100% for me, compared with many 'on the fly' augmentation strategies where GPu utilization is lower, at the expense of more CPU)

@mlundine - i agree that albumentations is the correct way to go. @dbuscombe-usgs - i agree that we don;t want to duplicate/save augmented images

dbuscombe-usgs commented 4 months ago

Yes that's a good summary. Pre augmentation (as oppsed to on the fly) has reproducibility benefits too. In the sense that the augmented data are saved in the "gpu ready" npz format, and it would be possible to in theory assess the distributions of augmented data post-hoc rather than the non-reproducible ad-hoc.

I think we're all interested in albumentations and I'm keen to get it at least as an option in the gym workflow

ebgoldstein commented 4 months ago

@mlundine - just loopiong back to getting Albumentations working w/o rewriting the augmentation pipeline:

Since we use the deprecated/old-style keras generators, the easiest method is to add a preprocessing function (https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator) in 3 easy steps:

  1. course adding an import:

    #import albumentations
    import albumentations as A
  2. defining a preprocessing function with your chosen albumentation augs:

    #preprocessing function with albumentations.. example with channel shuffle
    def albumentize(image):
    aug = A.Compose([
        A.ChannelShuffle(),
    ])
    AugI = aug(image=image)['image']
    
    return AugI
  3. add a call to the preprocessing function on line 719-739 of make_dataset.py

so add

preprocessing_function = albumentize, under fill_mode='reflect',

for both generators

hope this helps as a quick way to get Albumentations working! https://github.com/Doodleverse/segmentation_gym/blob/cb13c70d98bc9fe91b51ee5937d2b5cd3c516e6c/make_dataset.py#L719-L749

mlundine commented 4 months ago

Yes I understand the way you guys were doing this now and why.

For just the training set, we have a set of augmentations we can perform. We randomize which augmentation to perform and on which image from the training set, correct?

On Wed, Apr 24, 2024 at 3:18 PM Evan B. Goldstein @.***> wrote:

@mlundine https://github.com/mlundine - just loopiong back to getting Albumentations working w/o rewriting the augmentation pipeline:

Since we use the deprecated/old-style keras generators, the easiest method is to add a preprocessing function ( https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator) in 3 easy steps:

  1. course adding an import:

import albumentations

import albumentations as A

  1. defining a preprocessing function with your chosen albumentation augs:

preprocessing function with albumentations.. example with channel shuffle

def albumentize(image): aug = A.Compose([ A.ChannelShuffle(), ]) AugI = aug(image=image)['image']

return AugI
  1. add a call to the preprocessing function on line 719-739 of make_dataset.py

so add

preprocessing_function = albumentize, under fill_mode='reflect',

for both generators

hope this helps as a quick way to get Albumentations working!

https://github.com/Doodleverse/segmentation_gym/blob/cb13c70d98bc9fe91b51ee5937d2b5cd3c516e6c/make_dataset.py#L719-L749

— Reply to this email directly, view it on GitHub https://github.com/Doodleverse/segmentation_gym/issues/81#issuecomment-2075946108, or unsubscribe https://github.com/notifications/unsubscribe-auth/APHKACT7LFKAMKB5BBKKATTY7AVUNAVCNFSM6AAAAAAQMUWIGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZVHE2DMMJQHA . You are receiving this because you were mentioned.Message ID: @.***>

mlundine commented 4 months ago

Clarifying that more: we don't want duplicates (original image and augmented) in the training set? Or do we want a big training set with all original images plus each augmentation?

Just from experimenting a bit with albumentations, I think the ones we want (at least for satellite imagery) are the color-space alterations and the snow transform (just adding white pixels). The haze transform is kind of dumb, it's just circular blobs of haze. The other one that might be useful is the elastic transform (see attached images for original, color swapping, elastic, and snow). These would be in addition to the more standard augmentations you guys already have (rotations, flips, zooms, etc.). [image: 2022-07-09-22-25-01_RGB_L9.jpg] [image: 2022-07-09-22-25-01_RGB_L9augment2.jpg][image: 2022-07-09-22-25-01_RGB_L9augment99.jpg][image: 2022-07-09-22-25-01_RGB_L9snow.jpg]

On Wed, Apr 24, 2024 at 4:45 PM Mark Lundine @.***> wrote:

Yes I understand the way you guys were doing this now and why.

For just the training set, we have a set of augmentations we can perform. We randomize which augmentation to perform and on which image from the training set, correct?

On Wed, Apr 24, 2024 at 3:18 PM Evan B. Goldstein < @.***> wrote:

@mlundine https://github.com/mlundine - just loopiong back to getting Albumentations working w/o rewriting the augmentation pipeline:

Since we use the deprecated/old-style keras generators, the easiest method is to add a preprocessing function ( https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator) in 3 easy steps:

  1. course adding an import:

import albumentations

import albumentations as A

  1. defining a preprocessing function with your chosen albumentation augs:

preprocessing function with albumentations.. example with channel shuffle

def albumentize(image): aug = A.Compose([ A.ChannelShuffle(), ]) AugI = aug(image=image)['image']

return AugI
  1. add a call to the preprocessing function on line 719-739 of make_dataset.py

so add

preprocessing_function = albumentize, under fill_mode='reflect',

for both generators

hope this helps as a quick way to get Albumentations working!

https://github.com/Doodleverse/segmentation_gym/blob/cb13c70d98bc9fe91b51ee5937d2b5cd3c516e6c/make_dataset.py#L719-L749

— Reply to this email directly, view it on GitHub https://github.com/Doodleverse/segmentation_gym/issues/81#issuecomment-2075946108, or unsubscribe https://github.com/notifications/unsubscribe-auth/APHKACT7LFKAMKB5BBKKATTY7AVUNAVCNFSM6AAAAAAQMUWIGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZVHE2DMMJQHA . You are receiving this because you were mentioned.Message ID: @.***>

mlundine commented 4 months ago

2022-07-09-22-25-01_RGB_L9 2022-07-09-22-25-01_RGB_L9augment2 2022-07-09-22-25-01_RGB_L9augment99 2022-07-09-22-25-01_RGB_L9snow

ebgoldstein commented 4 months ago

Clarifying that more: we don't want duplicates (original image and augmented) in the training set? Or do we want a big training set with all original images plus each augmentation?

The way the we wrote it, the trainign split will all be augmentations, Val split is all non-augmented images in the validation. That being said, all the augmentations are random, so there is a possibility to get nonagumented (or weakly augmented) images in the training.

note also that in the config, AUG_COPIES will oversample your training split, so you can give it a bunch of different augmented copies of the training data...

ebgoldstein commented 4 months ago

I suggest if you want an albumentation version of Gym, feel free to create a branch (locally or on GH)... you could hard code it all in for your personal needs, but it would be awesome if you added variables to the config so that they can be turned on/off globally for everyone eventually

dbuscombe-usgs commented 4 months ago

I agree with Evan. It seems the change he is suggesting here https://github.com/Doodleverse/segmentation_gym/issues/81#issuecomment-2075946108 is simple enough it could be incorporated in the existing workflow easily (on a new branch). Doodleverse is definitely designed with a broad range of users and use-cases in mind. Perhaps it could be passed a list of albumentations-style augmentations you'd like. And if the list if empty (default), it just defaults to the status quo.

And yes, I have noticed that models tend to train better when presented with original plus augmented training data. There is no data leakage because the validation files are stored in a separate folder and are not augmented. If you wish to test this yourself,

  1. run make_datasets.py, then train_model.py to train a model
  2. delete all the non-augmented data (the files say 'noaug' in the name), then train_model.py again
  3. compare the 2 models

If you wish, you could add a config file parameter than suppresses the use of original imagery in training, but I recommend keeping original+augmentation by default