Doodleverse / segmentation_gym

A neural gym for training deep learning models to carry out geoscientific image segmentation. Works best with labels generated using https://github.com/Doodleverse/dash_doodler
MIT License
45 stars 11 forks source link

Tensor shape error in train_model.py #111

Closed jmdelvecchio closed 1 year ago

jmdelvecchio commented 1 year ago

Data input: RGB JPEGs converted from Sentinel-2 downloaded from Google Earth Engine and converted to JPEGs via GDAL. Labeled via Doodler and exported by gen_images_and_labels.py in Doodler. Binary classes.

Running on: Linux cluster in interactive session, k80 GPUs. Gym repo recently updated.

Error: Like In Issue #103 I get Cannot batch tensors with different shapes in component 0. First element had shape [512,512,1] and element 1 had shape [512,512,3].. After removing 12 images from the dataset the model trains normally. However, the "offending" images are all 3-band JPEGS.

Config file: { "TARGET_SIZE": [512,512], "MODEL": "resunet", "NCLASSES": 2, "KERNEL":7, "STRIDE":2, "BATCH_SIZE": 16, "FILTERS":6, "N_DATA_BANDS": 3, "DROPOUT":0.1, "DROPOUT_CHANGE_PER_LAYER":0.0, "DROPOUT_TYPE":"standard", "USE_DROPOUT_ON_UPSAMPLING":false, "DO_TRAIN": true, "LOSS":"cat", "PATIENCE": 10, "MAX_EPOCHS": 100, "VALIDATION_SPLIT": 0.6, "RAMPUP_EPOCHS": 20, "SUSTAIN_EPOCHS": 0.0, "EXP_DECAY": 0.9, "START_LR": 1e-7, "MIN_LR": 1e-7, "MAX_LR": 1e-4, "FILTER_VALUE": 0, "DOPLOT": true, "ROOT_STRING": "batch_size_16", "USEMASK": false, "AUG_ROT": 0.05, "AUG_ZOOM": 0.05, "AUG_WIDTHSHIFT": 0.05, "AUG_HEIGHTSHIFT": 0.05, "AUG_HFLIP": true, "AUG_VFLIP": false, "AUG_LOOPS": 10, "AUG_COPIES": 5, "TESTTIMEAUG": false, "SET_GPU": "0,1,2,3", "DO_CRF": true, "SET_PCI_BUS_ID": true, "TESTTIMEAUG": true, "WRITE_MODELMETADATA": true, "OTSU_THRESHOLD": true }

Dataset: delvecchio_images_and_labels.zip

FWIW: Even when I do manage to train a model on this data the results are trash šŸ™ƒ

Hypotheses: Does Gym hate black "NoData" in my images? Doodler will try to segment the black NoData space around my images and I wonder if that's confusing the model too.

jmdelvecchio commented 1 year ago

Oh sorry: the files with 8100197360 in the name are the ones I removed and it worked.

ebgoldstein commented 1 year ago

ok, one of us can take a look... ( I can;t get to it right now... might take me a few days, sorry)

dbuscombe-usgs commented 1 year ago

I think I have seen this error before when I had some images with only one class label

Therefore, in a given model batch, it is possible to have tensors that are a combined of W x H x 3 and W x H x 1

Could that be the issue for you @jmdelvecchio ?

If I'm right, we will need to think of a workaround.

jmdelvecchio commented 1 year ago

@dbuscombe-usgs that is definitely the case. When I chop up my watersheds it's easy to have one of the tiles covered by a single class especially since I'm doing binary labels.

dbuscombe-usgs commented 1 year ago

I'll download your images and labels and see if I can come up with a workaround. It may require a mod to make_datasets

dbuscombe-usgs commented 1 year ago

Ok, I made some progress I think ...

First, I made a class-balanced subset new_images.zip new_labels.zip

using this script balanced_labels.zip

There are 60 images with all 3 classes present. I used gen_overlays_from_images_and_labels.py from doodler/utils to gen some overlays and they look ok (I think) new_overlays.zip

I ran make_datasets, then do_train using your config file. I used a 2080Ti with 12gb memory. However, the loss went immediately to 'nan'. In the config file, I realized - 'cat' loss was specified with a LR too low for 'cat'. I switched to 'dice'. dice works best for most problems but if you want to use cat, use a larger LR. This time I get finite losses, phew! Here's the modified config file config.zip

Model did converge and the result isn't horrible but definitely room for improvement ....

modelOut_trainhist_24

first 20 validation sample outputs modelOutval.zip

modelOut_val_466 modelOut_val_465

Model weights and training history weights_n_hist.zip

While training, it was using only 10.5gb with a batch size of 24.

(Also, just curious why you are using 4 GPUs. It's a small dataset, target size, and batch size, so I would have assumed it would easily fit on one k80, which has 24gb of memory, right?)

jmdelvecchio commented 1 year ago

First off, šŸ™ thank you šŸ™

Second, wow those validations look awesome! I think I copied a config file from one of the Zoo models in an embodiment of šŸ¤·ā€ā™€ļø but I definitely have a lot to learn about what the different loss functions mean/when they are appropriate.

So if there's a chance that an image only has a single class we won't want to include it in training data? And best practice is to run balanced_labels.py before training? (For the purposes of model accuracy, does that bias the model too much? My gut is that it's important to tell the model "look at all these things that aren't water tracks." A solution is maybe an additional "not water tracks" class?)

Re: 4 GPUs - In my defense I initially had labeled 260 images and tried to train a model on them all at once when I had the shape error and cut back, but really the answer is that once I learned how to request multiple GPUs and that there were four of them my understanding of all this is still kind of "gpu go brrrr"

dbuscombe-usgs commented 1 year ago

Cool, happy to help out :shamrock: . yeah we only found out about the 'nan' losses when using categorical xentropy recently and I think we always use dice for everything.

there is a utility called make_class_balanced_subset.py as an example of creating a more class-balanced dataset but I think we should leave it up to the user how to prepare imagery because it varies case to case

I agree that we should be able to use batches of label tensors of different sizes, i.e., containing bands associated with differing numbers of classes in the batch

I think the best perhaps would be to modify https://github.com/Doodleverse/segmentation_gym/blob/main/make_nd_dataset.py#L339 so it forces the labels to be a depth of NCLASSES. would need to figure out what classes are present and missing and pad a pre-allocated label array accordingly. I should be able to play with that in the next couple of days and hopefully post a solution

Cool no worries about the GPUs, I was mostly concerned that if you were distributing a relatively small memory ask across 4 gpus the performance would very likely go down because of the overhead to do with keeping all those gpus fed with data. but i mostly just think that i know this kind of stuff and everything i know i learned from the keras manual or @ebgoldstein so I'm in the "gpu go brrrr" club too, and they actually act as handwarmers in my office because I'm basically making Gym models full time right now :smile:

dbuscombe-usgs commented 1 year ago

hi @jmdelvecchio please let me know when you've had a chance to test this, so we can closer this issue. Thanks!

jmdelvecchio commented 1 year ago

Yes apologies, your balanced_labels script did in fact fix this issue and I ran it on my larger dataset with no issues.

dbuscombe-usgs commented 1 year ago

Cool, thanks @jmdelvecchio