KernelDef: 'op: "ConcatV2" device_type: "CPU" constraint

moulicm111 commented 4 years ago

Explain this error 2.1.0 loading annotations into memory... Done (t=0.08s) creating index... index created! Traceback (most recent call last): File "train_model.py", line 50, in <module> _ = model((batch_imgs, batch_metas), training=False) File "/home/advancedtf/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 822, in _call_ outputs = self.call(cast_inputs, *args, **kwargs) File "/home/16-fasterRCNN/detection/models/detectors/faster_rcnn.py", line 157, in call rcnn_probs_list, rcnn_deltas_list, rois_list, img_metas) File "/home/16-fasterRCNN/detection/models/bbox_heads/bbox_head.py", line 121, in get_bboxes for i in range(img_metas.shape[0]) File "/home/16-fasterRCNN/detection/models/bbox_heads/bbox_head.py", line 121, in <listcomp> for i in range(img_metas.shape[0]) File "/home/16-fasterRCNN/detection/models/bbox_heads/bbox_head.py", line 188, in _get_bboxes_single nms_keep = tf.concat(nms_keep, axis=0) File "/home/advancedtf/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper return target(*args, **kwargs) File "/home/advancedtf/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py", line 1517, in concat return gen_array_ops.concat_v2(values=values, axis=axis, name=name) File "/home/advancedtf/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 1118, in concat_v2 _ops.raise_from_not_ok_status(e, name) File "/home/advancedtf/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status six.raise_from(core._status_to_exception(e.code, message), None) File "<string>", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2' has constraint on attr 'T' not in NodeDef '[N=0, Tidx=DT_INT32]', KernelDef: 'op: "ConcatV2" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_UINT64 } } } host_memory_arg: "axis"' [Op:ConcatV2] name: concat

Viredery commented 4 years ago

Hi @moulicm111 This is my old version of tf-eager-fasterrcnn and the bug has been fixed earlier. You can check this code in the latest codebase and figure it out. The crash reason is that the input of ConcatV2 op is empty.

moulicm111 commented 4 years ago

Thanks and please resolve this Total loss is NaN after some epochs . I am getting this warning 5 times in every epoch WARNING:tensorflow:Gradients do not exist for variables ['faster_rcnn/b_box_head/rcnn_class_conv1/kernel:0', 'faster_rcnn/b_box_head/rcnn_class_conv1/bias:0', 'faster_rcnn/b_box_head/rcnn_class_bn1/gamma:0', 'faster_rcnn/b_box_head/rcnn_class_bn1/beta:0', 'faster_rcnn/b_box_head/rcnn_class_conv2/kernel:0', 'faster_rcnn/b_box_head/rcnn_class_conv2/bias:0', 'faster_rcnn/b_box_head/rcnn_class_bn2/gamma:0', 'faster_rcnn/b_box_head/rcnn_class_bn2/beta:0', 'faster_rcnn/b_box_head/rcnn_class_logits/kernel:0', 'faster_rcnn/b_box_head/rcnn_class_logits/bias:0', 'faster_rcnn/b_box_head/rcnn_bbox_fc/kernel:0', 'faster_rcnn/b_box_head/rcnn_bbox_fc/bias:0'] when minimizing the loss.

Viredery commented 4 years ago

Are you using the newest version and loading weights? Pre-trained weights and ResNet101 weights trained on ImageNet are both acceptable.

moulicm111 commented 4 years ago

I didn't load the pretrained weights I want to train from scratch as it is a differnt problem (manipulation detection). I tried gradient clipping then the loss does not goes to "nan" but it seems that loss is not converging even though I tried overfitting on a small dataset . I want to know the impact of those warning messages.

Viredery commented 4 years ago

There are lots of things you need to think of.

For example,
(1) BN weights are fixed when training in this codebase. So you need to modify this setting. Next, if you want to train BN, you need to use multi-GPU instead of single GPU. So you can't just run this code to train COCO from scratch. (2) You have to select one of layer initialization methods. A wrong initialization method may cause exploding and finally the loss may get NaN. (3) etc...

And I suggest you can choose a smaller backbone to train easily, e.g. VGG16 or ResNet18.

moulicm111 commented 4 years ago

Thanks for your time and valuable suggestions.

Viredery / tf-eager-fasterrcnn

KernelDef: 'op: "ConcatV2" device_type: "CPU" constraint #9