Doodleverse / segmentation_gym

A neural gym for training deep learning models to carry out geoscientific image segmentation. Works best with labels generated using https://github.com/Doodleverse/dash_doodler
MIT License
45 stars 11 forks source link

FILTER_VALUE>1 produce issues in make_dataset.py #132

Closed CameronBodine closed 1 year ago

CameronBodine commented 1 year ago

Describe the bug When I specify a FILTER_VALUE != 1 (i.e. FILTER_VALUE=2) while running make_dataset.py (using current version), the examples saved in Output/train_data/train_npzs/noaug_sample only show one class when there should be up to 8, as shown below.

EGN_Substrate_inclShadownoaug_ex0

To Reproduce Steps to reproduce the behavior:

  1. Specify a value FILTER_VALUE=2 in config.json
    Click me to see complete config
{
    "TARGET_SIZE": [
        512,
        512
    ],
    "MODEL": "segformer",
    "NCLASSES": 8,
    "BATCH_SIZE": 30,
    "N_DATA_BANDS": 1,
    "DO_TRAIN": true,
    "PATIENCE": 10,
    "MAX_EPOCHS": 10,
    "VALIDATION_SPLIT": 0.6,
    "FILTERS": 6,
    "KERNEL": 7,
    "STRIDE": 2,
    "LOSS": "dice",
    "DROPOUT": 0.1,
    "DROPOUT_CHANGE_PER_LAYER": 0.0,
    "DROPOUT_TYPE": "standard",
    "USE_DROPOUT_ON_UPSAMPLING": false,
    "ROOT_STRING": "EGN_Substrate_inclShadow",
    "FILTER_VALUE": 2,
    "DOPLOT": true,
    "USEMASK": false,
    "RAMPUP_EPOCHS": 10,
    "SUSTAIN_EPOCHS": 0.0,
    "EXP_DECAY": 0.9,
    "START_LR": 1e-07,
    "MIN_LR": 1e-07,
    "MAX_LR": 0.0001,
    "AUG_ROT": 0,
    "AUG_ZOOM": 0.05,
    "AUG_WIDTHSHIFT": 0.05,
    "AUG_HEIGHTSHIFT": 0.05,
    "AUG_HFLIP": true,
    "AUG_VFLIP": false,
    "AUG_LOOPS": 3,
    "AUG_COPIES": 3,
    "TESTTIMEAUG": false,
    "SET_GPU": "0",
    "DO_CRF": false,
    "SET_PCI_BUS_ID": true,
    "WRITE_MODELMETADATA": true,
    "OTSU_THRESHOLD": true,
    "REMAP_CLASSES": {
        "0": 0,
        "1": 1,
        "2": 2,
        "3": 3,
        "4": 4,
        "5": 5,
        "6": 6,
        "7": 7,
        "8": 0
    }
}

  1. Run make_dataset.py

Expected behavior I expect to see all the classes, as in when I specify FILTER_VALUE=1 as shown below. EGN_Substrate_inclShadownoaug_ex0

Desktop (please complete the following information):

I will investigate this further this morning.

CameronBodine commented 1 year ago

I identified the source of the issue:

https://github.com/Doodleverse/segmentation_gym/blob/43edaaf445c21727d2fc4a7658521d655733c3c5/make_dataset.py#L455-L465

In my case, if I change: https://github.com/Doodleverse/segmentation_gym/blob/43edaaf445c21727d2fc4a7658521d655733c3c5/make_dataset.py#L552

to:

final_sum = initial_sum

in order to bypass if final_sum < initial_sum: ### this ambiguity can happen in 0/1 masks (NCLASSES=2), then the files export as expected.

EGN_Substrate_inclShadownoaug_ex0

Potential Fix (just a recommendation) Change if statement to:

if (final_sum < initial_sum) and (NCLASSES==2):
dbuscombe-usgs commented 1 year ago

Cam, thanks for looking into this function, which is designed for NCLASSES=2. I recently updated it and I don't think the docs have changed for quite a while (apologies).

Please trial your proposed extension to NCLASSES >2 and submit a PR, thanks!

dbuscombe-usgs commented 1 year ago

Thanks for the PR @CameronBodine . the new version does not work for me on a 5-band dataset, so I will need to spend some time fixing it.

dbuscombe-usgs commented 1 year ago

Reopening because a) I did not update make_datasets for N_DATA_BANDS>3 on my latest revision b) Cam's PR only applies to N_DATA_BANDS <=3, and the changes break the other cases. For example, [f[0] for f in files] breaks everything because it gets rid of all the additional sets of images.

Therefore, with hindsight, I probably should not have asked Cam to modify. Sorry, Cam.

This really is a very tricky script to modify, and hopefully not too many more changes (!), but let's all get better at testing the N_DATA_BANDS>3 case before making suggestions for improving and using make_datasets. (also, if anyone wants to just rewrite this whole insane workflow so it doesn;t use keras' augmentation options, which are super hard to use, please feel free!!!)

Also, while I'm here, it is NO LONGER the case that label files have to have _label suffix. Ideally, your images and labels have IDENTICAL file names. Otherwise, ensure they natsort the same ... that's the only requirement, other than there being equal numbers of labels and images.

Here's an example dataset

I'm now modifying the script so it works with all cases.

dbuscombe-usgs commented 1 year ago

I my revision, I have attempted to simplify things by moving some repurposed codes into functions. So, we're still under 1000 lines of code! https://github.com/Doodleverse/segmentation_gym/commit/cf47e63b8565648fca6e33052bfed6b0a1754869

I also managed to simplify the workflow, by offloading any move commands to the resize functions. These changes have required changes to doodleverse-utils (the much-maligned 1-star repo at the heart of the doodleverse)

https://pypi.org/project/doodleverse-utils/0.0.31/

pip install -U doodleverse-utils

CameronBodine commented 1 year ago

Thanks Dan, will give this a test. doodleverse-utils now has 2 stars ;-)