Open DaniBodor opened 3 years ago
Hi, Thanks for reaching out. The error message itself is not that easy to troobleshoot.
Does the notebook give you an error when using the provided test dataset? Does the notebook give you the error when you do not change "checkpoints_path" ?
Cheers
Guillaume
Does the notebook give you the error when you do not change "checkpoints_path" ?
If I do not change checkpoints_path, then cell 3.3 does not crash, but displays "Checkpoint's path does not exist." after which Cell 4.1 crashes with an error in line 18: "NameError: name 'configs' is not defined." This makes sense, because configs is only defined in 3.3 if checkpoints_path does exist. If I do change the path, then in Cell 3.3 I instead get (in green): "checkpoints loaded correctly."
Does the notebook give you an error when using the provided test dataset?
It seems to run fine on the test data set you've provided (at least it did for the first 6 epochs and then I stopped it).
.
I have now re-tested it on my own data set and get a ResourceExhaustedError (as opposed to the InternalError I got before)
INFO:tensorflow:Writing pipeline config file to /content/gdrive/MyDrive/Colab Notebooks/Models/RetinaNet_Errors_210907/saved_model/config/pipeline.config
Done training data preprocessing.
Done validation data preprocessing.
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
<ipython-input-8-aa8a4659be7c> in <module>()
26 start = time.time()
27
---> 28 train(pretrained_model, verbose = Verbose)
29
30 # Displaying the time elapsed for training
6 frames
<ipython-input-1-0daf2db5b026> in train(model, verbose)
876
877 # Training step (forward pass + backwards pass)
--> 878 total_loss, localization_loss, classification_loss = train_step_fn(image_tensors, gt_boxes_list, gt_classes_list)
879
880
<ipython-input-1-0daf2db5b026> in train_step_fn(image_tensors, groundtruth_boxes_list, groundtruth_classes_list)
828 preprocessed_images = tf.concat(
829 [model.preprocess(image_tensor)[0]
--> 830 for image_tensor in image_tensors], axis=0)
831 prediction_dict = model.predict(preprocessed_images, shapes)
832
/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py in wrapper(*args, **kwargs)
204 """Call target, and fall back on dispatchers if there is a TypeError."""
205 try:
--> 206 return target(*args, **kwargs)
207 except (TypeError, ValueError):
208 # Note: convert_to_eager_tensor currently raises a ValueError, not a
/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py in concat(values, axis, name)
1767 dtype=dtypes.int32).get_shape().assert_has_rank(0)
1768 return identity(values[0], name=name)
-> 1769 return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
1770
1771
/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_array_ops.py in concat_v2(values, axis, name)
1211 return _result
1212 except _core._NotOkStatusException as e:
-> 1213 _ops.raise_from_not_ok_status(e, name)
1214 except _core._FallbackException:
1215 pass
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
6939 message = e.message + (" name: " + name if name is not None else "")
6940 # pylint: disable=protected-access
-> 6941 six.raise_from(core._status_to_exception(e.code, message), None)
6942 # pylint: enable=protected-access
6943
/usr/local/lib/python3.7/dist-packages/six.py in raise_from(value, from_value)
ResourceExhaustedError: OOM when allocating tensor with shape[8,640,640,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat
Hi, Thanks that's very helpful! Regarding your "ResourceExhaustedError", looks like you are overloading the GPU. Try decreasing the "batch_size" parameter to fix this.
Are you using RGB images?
Cheers
Guillaume
Are you using RGB images?
I'm using 8bit grayscale PNGs. size ~1000x1000 pixels. For YOLO I noticed that if I export these from ImageJ as RGBs with the Fire LUT, I get better results than using grayscale. In RetinaNet, I tried using these as RGB, but that gave me a different error, which I at the time recognized was likely due to them not being grayscale.
Regarding your "ResourceExhaustedError", looks like you are overloading the GPU. Try decreasing the "batch_size" parameter to fix this.
I am testing different batch sizes and am getting mixed, inconsistent results. It's currently ok using batch size 2, but earlier it was not (also not at 4). It also seems to depend on exactly which dataset I use (augmented, not augmented, what I annotate). I will get back to you on this when I start making sense of what's going wrong.
It might also be related to this:
Note that in Cell 3.3, I changed the the location of checkpoints_path, as this was not actually stored in model_path, but directly in the content folder
I discovered why this happens. The default folder where the notebooks are copied to on Drive contains a space: /content/gdrive/MyDrive/Colab Notebooks/
. In the download_weights
function defined in Cell 1, tries to move the checkpoint folder here, but the mv function doesn't work because it reads the space as if there were an extra argument.
I now replace %mv $checkpoint_current_path $model_path
by:
mv_target = '\"' + model_path + '\"'
%mv $checkpoint_current_path $mv_target
and this seems to work without replacing checkpoints_path = os.path.join(model_path,'checkpoint')
as in my original post. I will retry the different things I have tried previously with this in place and get back to you whether it's working or not.
Hello @DaniBodor,
Would it be possible to share with us some of your images so we can try to reproduce the error, please?
Thanks in advance!
Hi @iarganda, sorry for not getting back to you. Have been busy with something else for the last couple of weeks.
Here are links to the data I used for training: images: https://drive.google.com/drive/folders/1Pk1swik0rJ_HvSYQ-sNlrv3fB0AbDg9P?usp=sharing annotations: https://drive.google.com/drive/folders/1k_KIkTbkOmPaYAzTa5oPLsD0DYyS2mZi?usp=sharing
Hi @DaniBodor It's like the annotations link is not working anymore. If I could access to it I could try to solve the issue 😁
Thank you!
Hi,
I tried running the beta RetinaNet notebook, but ran into an error in Cell 4. The same data set works fine for YOLOv2, so I think that this should not be an issue.
Note that in Cell 3.3, I changed the the location of
checkpoints_path
, as this was not actually stored inmodel_path
, but directly in the content folder:Help would be very much appreciated!
This is the error message I get: