Doodleverse / segmentation_gym

A neural gym for training deep learning models to carry out geoscientific image segmentation. Works best with labels generated using https://github.com/Doodleverse/dash_doodler
MIT License
45 stars 11 forks source link

Tensor Shape error when mixing Greyscale + Color images #103

Closed sbosse12 closed 2 years ago

sbosse12 commented 2 years ago

Hi all, When I attempt to train a model classifying oblique coastline imagery into three classes (water, land, sky), I receive this error below:

Epoch 00001: LearningRateScheduler setting learning rate to 1e-07. Traceback (most recent call last): File "X:\Imagery\CamerasOfOpportunity\2022_Ian_Doodleverse\segmentation_gym\train_model.py", line 760, in history = model.fit(train_ds, steps_per_epoch=steps_per_epoch, epochs=MAX_EPOCHS, File "C:\ProgramData\Anaconda3\envs\gym\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1193, in fit tmp_logs = self.train_function(iterator) File "C:\ProgramData\Anaconda3\envs\gym\lib\site-packages\tensorflow\python\eager\def_function.py", line 885, in call result = self._call(*args, *kwds) File "C:\ProgramData\Anaconda3\envs\gym\lib\site-packages\tensorflow\python\eager\def_function.py", line 950, in _call return self._stateless_fn(args, **kwds) File "C:\ProgramData\Anaconda3\envs\gym\lib\site-packages\tensorflow\python\eager\function.py", line 3039, in call return graph_function._call_flat( File "C:\ProgramData\Anaconda3\envs\gym\lib\site-packages\tensorflow\python\eager\function.py", line 1963, in _call_flat return self._build_call_outputs(self._inference_function.call( File "C:\ProgramData\Anaconda3\envs\gym\lib\site-packages\tensorflow\python\eager\function.py", line 591, in call outputs = execute.execute( File "C:\ProgramData\Anaconda3\envs\gym\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Cannot batch tensors with different shapes in component 0. First element had shape [768,1024,1] and element 1 had shape [768,1024,3]. [[node IteratorGetNext (defined at X:\Imagery\CamerasOfOpportunity\2022_Ian_Doodleverse\segmentation_gym\train_model.py:760) ]] (1) Invalid argument: Cannot batch tensors with different shapes in component 0. First element had shape [768,1024,1] and element 1 had shape [768,1024,3]. [[node IteratorGetNext (defined at X:\Imagery\CamerasOfOpportunity\2022_Ian_Doodleverse\segmentation_gym\train_model.py:760) ]] [[Shape/_6]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_13417]

Function call stack: train_function -> train_function

Are y'all familiar with this?

Here is my config file for reference as well as some test images/labels Ian_3class_test.zip

"TARGET_SIZE": [768, 1024], "MODEL": "resunet", "NCLASSES": 3, "BATCH_SIZE": 7, "N_DATA_BANDS": 3, "DO_TRAIN": true, "PATIENCE": 10, "MAX_EPOCHS": 100, "VALIDATION_SPLIT": 0.6, "FILTERS":6, "KERNEL":9, "STRIDE":2, "LOSS": "dice", "DROPOUT":0.1, "DROPOUT_CHANGE_PER_LAYER":0.0, "DROPOUT_TYPE":"standard", "USE_DROPOUT_ON_UPSAMPLING":false, "ROOT_STRING": "Hurr_Ian_water_mask", "FILTER_VALUE": 0, "DOPLOT": true, "USEMASK": false, "RAMPUP_EPOCHS": 20, "SUSTAIN_EPOCHS": 0.0, "EXP_DECAY": 0.9, "START_LR": 1e-7, "MIN_LR": 1e-7, "MAX_LR": 1e-4, "AUG_ROT": 5, "AUG_ZOOM": 0.05, "AUG_WIDTHSHIFT": 0.05, "AUG_HEIGHTSHIFT": 0.05, "AUG_HFLIP": true, "AUG_VFLIP": false, "AUG_LOOPS": 10, "AUG_COPIES": 5, "TESTTIMEAUG": false, "SET_GPU": "0", "do_crf": true, "SET_PCI_BUS_ID": true

ebgoldstein commented 2 years ago

this is a Tensor shape error:

Cannot batch tensors with different shapes in component 0. First element had shape [768,1024,1] and element 1 had shape [768,1024,3].

It seems like your images might be a mix of greyscale and color?

Note that the config asks for "N_DATA_BANDS" and expects that to hold for all the images.

The easiest thing to do here is try to make a model using either grayscale or color (or you could convert all greyscale -> color.. or vice versa... then run makedatasets again, and train again)...

does this make sense?

sbosse12 commented 2 years ago

yes, that does make sense! Good to know. I was thinking it was an error that occurred during the make_nd_datasets phase. I'll convert, try again and report back.

sbosse12 commented 2 years ago

Thanks Evan, we're running now!

ebgoldstein commented 2 years ago

nice!

jmdelvecchio commented 1 year ago

Hey folks, I get the same error message but I do not have grayscale images in my datsests. All RGB JPEGs. I first noticed this issue and with trial and error I found that it would work on a datasets of 100 images, but not >100 images. Then I returned to this a month later and then my 100-image dataset (literally the same files) no longer worked (but I had adjusted the batch size in the config file, the only change!). I brought the image number down to 88 and it worked.

Fresh update of Gym, happened before and after (early November and as we speak).

My only wonder is if it's not the size of the datasets at all but I"m removing offending images, but like I said no mixing of image type.

ebgoldstein commented 1 year ago

@jmdelvecchio - i think you are right that there are offending images you could add them in groups to find the problems. Also just keep in mind that if you adjust batch size, you need to rerun makedatasets...

Also, do you want to keep discussing this? or reopen this issue or make a new one?

feel free to drop a config file in here, and even send us a link (via email) of the zipped images & labels...

jmdelvecchio commented 1 year ago

I re-ran makedatasets; also just looked over potential "offending" images (I removed 12 images from a single AOI) and nothing strikes me as wrong so perhaps a new issue since it's not greyscale. I'll go and make one now.