HenriquesLab / ZeroCostDL4Mic

ZeroCostDL4Mic: A Google Colab based no-cost toolbox to explore Deep-Learning in Microscopy
Crash when trying to run RetinaNet #140

I tried running the beta RetinaNet notebook, but ran into an error in Cell 4. The same data set works fine for YOLOv2, so I think that this should not be an issue.

Note that in Cell 3.3, I changed the the location of checkpoints_path, as this was not actually stored in model_path, but directly in the content folder:

#checkpoints_path = os.path.join(model_path,'checkpoint')
  checkpoints_path = '/content/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint'

Help would be very much appreciated!

This is the error message I get:

InternalError                             Traceback (most recent call last)
InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
Hi, Thanks for reaching out. The error message itself is not that easy to troobleshoot.

Does the notebook give you an error when using the provided test dataset? Does the notebook give you the error when you do not change "checkpoints_path" ?



Does the notebook give you the error when you do not change "checkpoints_path" ?

If I do not change checkpoints_path, then cell 3.3 does not crash, but displays "Checkpoint's path does not exist." after which Cell 4.1 crashes with an error in line 18: "NameError: name 'configs' is not defined." This makes sense, because configs is only defined in 3.3 if checkpoints_path does exist. If I do change the path, then in Cell 3.3 I instead get (in green): "checkpoints loaded correctly."

Does the notebook give you an error when using the provided test dataset?

It seems to run fine on the test data set you've provided (at least it did for the first 6 epochs and then I stopped it).


I have now re-tested it on my own data set and get a ResourceExhaustedError (as opposed to the InternalError I got before)

INFO:tensorflow:Writing pipeline config file to /content/gdrive/MyDrive/Colab Notebooks/Models/RetinaNet_Errors_210907/saved_model/config/pipeline.config
Done training data preprocessing.
Done validation data preprocessing.
ResourceExhaustedError                    Traceback (most recent call last)
ResourceExhaustedError: OOM when allocating tensor with shape[8,640,640,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat
Hi, Thanks that's very helpful! Regarding your "ResourceExhaustedError", looks like you are overloading the GPU. Try decreasing the "batch_size" parameter to fix this.

Are you using RGB images?



Are you using RGB images?

I'm using 8bit grayscale PNGs. size ~1000x1000 pixels. For YOLO I noticed that if I export these from ImageJ as RGBs with the Fire LUT, I get better results than using grayscale. In RetinaNet, I tried using these as RGB, but that gave me a different error, which I at the time recognized was likely due to them not being grayscale.

Regarding your "ResourceExhaustedError", looks like you are overloading the GPU. Try decreasing the "batch_size" parameter to fix this.

I am testing different batch sizes and am getting mixed, inconsistent results. It's currently ok using batch size 2, but earlier it was not (also not at 4). It also seems to depend on exactly which dataset I use (augmented, not augmented, what I annotate). I will get back to you on this when I start making sense of what's going wrong.

It might also be related to this:

Note that in Cell 3.3, I changed the the location of checkpoints_path, as this was not actually stored in model_path, but directly in the content folder

I discovered why this happens. The default folder where the notebooks are copied to on Drive contains a space: /content/gdrive/MyDrive/Colab Notebooks/. In the download_weights function defined in Cell 1, tries to move the checkpoint folder here, but the mv function doesn't work because it reads the space as if there were an extra argument. I now replace %mv $checkpoint_current_path $model_path by:

  mv_target = '\"' + model_path + '\"'
  %mv $checkpoint_current_path $mv_target

and this seems to work without replacing checkpoints_path = os.path.join(model_path,'checkpoint') as in my original post. I will retry the different things I have tried previously with this in place and get back to you whether it's working or not.

Hello @DaniBodor,

Would it be possible to share with us some of your images so we can try to reproduce the error, please?

Thanks in advance!

Hi @iarganda, sorry for not getting back to you. Have been busy with something else for the last couple of weeks.

Here are links to the data I used for training: images: https://drive.google.com/drive/folders/1Pk1swik0rJ_HvSYQ-sNlrv3fB0AbDg9P?usp=sharing annotations: https://drive.google.com/drive/folders/1k_KIkTbkOmPaYAzTa5oPLsD0DYyS2mZi?usp=sharing

Hi @DaniBodor It's like the annotations link is not working anymore. If I could access to it I could try to solve the issue 😁

Thank you!