Crash when trying to run RetinaNet

DaniBodor commented 3 years ago

Hi,

I tried running the beta RetinaNet notebook, but ran into an error in Cell 4. The same data set works fine for YOLOv2, so I think that this should not be an issue.

Note that in Cell 3.3, I changed the the location of checkpoints_path, as this was not actually stored in model_path, but directly in the content folder:

#checkpoints_path = os.path.join(model_path,'checkpoint')
  checkpoints_path = '/content/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint'

Help would be very much appreciated!

This is the error message I get:

InternalError                             Traceback (most recent call last)
<ipython-input-7-a129353f211e> in <module>()
     22 start = time.time()
     23 
---> 24 train(pretrained_model, verbose = Verbose)
     25 
     26 # Displaying the time elapsed for training

11 frames
<ipython-input-1-0daf2db5b026> in train(model, verbose)
    742 
    743 def train(model, verbose = True):
--> 744   train_image_tensors, gt_classes_one_hot_tensors, gt_box_tensors = prepare_data_to_train(augmented_training_source, df_anno, gt_boxes, gt_classes)
    745   print('Done training data preprocessing.')
    746 

<ipython-input-1-0daf2db5b026> in prepare_data_to_train(training_img_path, df, data_gt_boxes, data_gt_classes)
    729     img = cv2.cvtColor(img,cv2.COLOR_GRAY2RGB)
    730     train_image_tensors.append(tf.expand_dims(tf.convert_to_tensor(
--> 731       img, dtype=tf.float32), axis=0))
    732 
    733     predicted_classes = np.zeros(shape=[data_gt_boxes[index].shape[0]], dtype=np.int32)

/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py in wrapper(*args, **kwargs)
    204     """Call target, and fall back on dispatchers if there is a TypeError."""
    205     try:
--> 206       return target(*args, **kwargs)
    207     except (TypeError, ValueError):
    208       # Note: convert_to_eager_tensor currently raises a ValueError, not a

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in convert_to_tensor_v2_with_dispatch(value, dtype, dtype_hint, name)
   1429   """
   1430   return convert_to_tensor_v2(
-> 1431       value, dtype=dtype, dtype_hint=dtype_hint, name=name)
   1432 
   1433 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in convert_to_tensor_v2(value, dtype, dtype_hint, name)
   1439       name=name,
   1440       preferred_dtype=dtype_hint,
-> 1441       as_ref=False)
   1442 
   1443 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/profiler/trace.py in wrapped(*args, **kwargs)
    161         with Trace(trace_name, **trace_kwargs):
    162           return func(*args, **kwargs)
--> 163       return func(*args, **kwargs)
    164 
    165     return wrapped

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
   1564 
   1565     if ret is None:
-> 1566       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1567 
   1568     if ret is NotImplemented:

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/tensor_conversion_registry.py in _default_conversion_function(***failed resolving arguments***)
     50 def _default_conversion_function(value, dtype, name, as_ref):
     51   del as_ref  # Unused.
---> 52   return constant_op.constant(value, dtype, name=name)
     53 
     54 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
    270   """
    271   return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 272                         allow_broadcast=True)
    273 
    274 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    281       with trace.Trace("tf.constant"):
    282         return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
--> 283     return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    284 
    285   g = ops.get_default_graph()

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    306 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape):
    307   """Creates a constant on the current device."""
--> 308   t = convert_to_eager_tensor(value, ctx, dtype)
    309   if shape is None:
    310     return t

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
    104       dtype = dtypes.as_dtype(dtype).as_datatype_enum
    105   ctx.ensure_initialized()
--> 106   return ops.EagerTensor(value, ctx.device_name, dtype)
    107 
    108 

InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

guijacquemet commented 3 years ago

Hi, Thanks for reaching out. The error message itself is not that easy to troobleshoot.

Does the notebook give you an error when using the provided test dataset? Does the notebook give you the error when you do not change "checkpoints_path" ?

Cheers

Guillaume

DaniBodor commented 3 years ago

Does the notebook give you the error when you do not change "checkpoints_path" ?

If I do not change checkpoints_path, then cell 3.3 does not crash, but displays "Checkpoint's path does not exist." after which Cell 4.1 crashes with an error in line 18: "NameError: name 'configs' is not defined." This makes sense, because configs is only defined in 3.3 if checkpoints_path does exist. If I do change the path, then in Cell 3.3 I instead get (in green): "checkpoints loaded correctly."

Does the notebook give you an error when using the provided test dataset?

It seems to run fine on the test data set you've provided (at least it did for the first 6 epochs and then I stopped it).

.

I have now re-tested it on my own data set and get a ResourceExhaustedError (as opposed to the InternalError I got before)

INFO:tensorflow:Writing pipeline config file to /content/gdrive/MyDrive/Colab Notebooks/Models/RetinaNet_Errors_210907/saved_model/config/pipeline.config
Done training data preprocessing.
Done validation data preprocessing.
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-8-aa8a4659be7c> in <module>()
     26 start = time.time()
     27 
---> 28 train(pretrained_model, verbose = Verbose)
     29 
     30 # Displaying the time elapsed for training

6 frames
<ipython-input-1-0daf2db5b026> in train(model, verbose)
    876 
    877       # Training step (forward pass + backwards pass)
--> 878       total_loss, localization_loss, classification_loss = train_step_fn(image_tensors, gt_boxes_list, gt_classes_list)
    879 
    880 

<ipython-input-1-0daf2db5b026> in train_step_fn(image_tensors, groundtruth_boxes_list, groundtruth_classes_list)
    828           preprocessed_images = tf.concat(
    829               [model.preprocess(image_tensor)[0]
--> 830               for image_tensor in image_tensors], axis=0)
    831           prediction_dict = model.predict(preprocessed_images, shapes)
    832 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py in wrapper(*args, **kwargs)
    204     """Call target, and fall back on dispatchers if there is a TypeError."""
    205     try:
--> 206       return target(*args, **kwargs)
    207     except (TypeError, ValueError):
    208       # Note: convert_to_eager_tensor currently raises a ValueError, not a

/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py in concat(values, axis, name)
   1767           dtype=dtypes.int32).get_shape().assert_has_rank(0)
   1768       return identity(values[0], name=name)
-> 1769   return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
   1770 
   1771 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_array_ops.py in concat_v2(values, axis, name)
   1211       return _result
   1212     except _core._NotOkStatusException as e:
-> 1213       _ops.raise_from_not_ok_status(e, name)
   1214     except _core._FallbackException:
   1215       pass

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
   6939   message = e.message + (" name: " + name if name is not None else "")
   6940   # pylint: disable=protected-access
-> 6941   six.raise_from(core._status_to_exception(e.code, message), None)
   6942   # pylint: enable=protected-access
   6943 

/usr/local/lib/python3.7/dist-packages/six.py in raise_from(value, from_value)

ResourceExhaustedError: OOM when allocating tensor with shape[8,640,640,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat

guijacquemet commented 3 years ago

Hi, Thanks that's very helpful! Regarding your "ResourceExhaustedError", looks like you are overloading the GPU. Try decreasing the "batch_size" parameter to fix this.

Are you using RGB images?

Cheers

Guillaume

DaniBodor commented 3 years ago

Are you using RGB images?

I'm using 8bit grayscale PNGs. size ~1000x1000 pixels. For YOLO I noticed that if I export these from ImageJ as RGBs with the Fire LUT, I get better results than using grayscale. In RetinaNet, I tried using these as RGB, but that gave me a different error, which I at the time recognized was likely due to them not being grayscale.

Regarding your "ResourceExhaustedError", looks like you are overloading the GPU. Try decreasing the "batch_size" parameter to fix this.

I am testing different batch sizes and am getting mixed, inconsistent results. It's currently ok using batch size 2, but earlier it was not (also not at 4). It also seems to depend on exactly which dataset I use (augmented, not augmented, what I annotate). I will get back to you on this when I start making sense of what's going wrong.

It might also be related to this:

Note that in Cell 3.3, I changed the the location of checkpoints_path, as this was not actually stored in model_path, but directly in the content folder

I discovered why this happens. The default folder where the notebooks are copied to on Drive contains a space: /content/gdrive/MyDrive/Colab Notebooks/. In the download_weights function defined in Cell 1, tries to move the checkpoint folder here, but the mv function doesn't work because it reads the space as if there were an extra argument. I now replace %mv $checkpoint_current_path $model_path by:

  mv_target = '\"' + model_path + '\"'
  %mv $checkpoint_current_path $mv_target

and this seems to work without replacing checkpoints_path = os.path.join(model_path,'checkpoint') as in my original post. I will retry the different things I have tried previously with this in place and get back to you whether it's working or not.

iarganda commented 3 years ago

Hello @DaniBodor,

Would it be possible to share with us some of your images so we can try to reproduce the error, please?

Thanks in advance!

DaniBodor commented 3 years ago

Hi @iarganda, sorry for not getting back to you. Have been busy with something else for the last couple of weeks.

Here are links to the data I used for training: images: https://drive.google.com/drive/folders/1Pk1swik0rJ_HvSYQ-sNlrv3fB0AbDg9P?usp=sharing annotations: https://drive.google.com/drive/folders/1k_KIkTbkOmPaYAzTa5oPLsD0DYyS2mZi?usp=sharing

ErlantzCalvo commented 2 years ago

Hi @DaniBodor It's like the annotations link is not working anymore. If I could access to it I could try to solve the issue 😁

Thank you!

HenriquesLab / ZeroCostDL4Mic

Crash when trying to run RetinaNet #140