[potential Bug] error in training FRRN-B

arasharchor commented 6 years ago

Bug reports

Information

Please specify the following information when submitting an issue:

What are your command line arguments?:python main.py --num_epochs 100 --mode train --dataset CUSTOM_512 --batch_size 1 --num_val_images 11 --model FRRN-B --crop_height 512 --crop_width 512;
Have you written any custom code?: No
What have you done to try and solve this issue?:
TensorFlow version?:

Describe the problem

In case of training with FRRN-B, the algorithm throws the following error unlike FRRN-A which was trained without a problem.

Source code / logs

The main part of the error log is

InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [1,384,32,32] vs. shape[1] = [1,32,30,30]

and the complete log is:

2018-08-02 18:01:52.158947: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-08-02 18:01:53.294597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:08:00.0
totalMemory: 11.90GiB freeMemory: 11.75GiB
2018-08-02 18:01:53.294653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-08-02 18:01:53.832936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-02 18:01:53.832988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-08-02 18:01:53.832996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-08-02 18:01:53.834042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11374 MB memory) -> physical GPU (device: 0, name: TITAN X (Pascal), pci bus id: 0000:08:00.0, compute capability: 6.1)
Preparing the model ...
WARNING:tensorflow:From main_orgsettings.py:250: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.

This model has 24748690 trainable parameters
Loading the data ...

***** Begin training *****
Dataset --> CUSTOM_512
Model --> FRRN-B
Crop Height --> 512
Crop Width --> 512
Num Epochs --> 100
Batch Size --> 1
Num Classes --> 2
Num Training Images --> 6720
Data Augmentation:
        Vertical Flip --> False
        Horizontal Flip --> False
        Brightness Alteration --> None
        Rotation --> None

Traceback (most recent call last):
  File "main_orgsettings.py", line 369, in <module>
    _,current=sess.run([opt,loss],feed_dict={net_input:input_image_batch,net_output:output_image_batch, learning_rate: lr})
  File "/home/maj/.virtualenvs/objdetTF3/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/home/maj/.virtualenvs/objdetTF3/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/maj/.virtualenvs/objdetTF3/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/home/maj/.virtualenvs/objdetTF3/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [1,384,32,32] vs. shape[1] = [1,32,30,30]
         [[Node: concat_13 = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](concat_13-0-TransposeNHWCToNCHW-LayoutOptimizer, max_pool_18, concat_13-2-LayoutOptimizer)]]
         [[Node: gradients/concat_15_grad/ConcatOffset/_345 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1485_gradients/concat_15_grad/ConcatOffset", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op u'concat_13', defined at:
  File "main_orgsettings.py", line 179, in <module>
    network = build_frrn(net_input, preset_model = args.model, num_classes=num_classes)
  File "models/FRRN.py", line 191, in build_frrn
    pool_stream, res_stream = FullResolutionResidualUnit(pool_stream=pool_stream, res_stream=res_stream, n_filters_3=192, n_filters_1=32, pool_scale=17)
  File "models/FRRN.py", line 47, in FullResolutionResidualUnit
    G = tf.concat([pool_stream, slim.pool(res_stream, [pool_scale, pool_scale], stride=[pool_scale, pool_scale], pooling_type='MAX')], axis=-1)
  File "/home/maj/.virtualenvs/objdetTF2.7/local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1189, in concat
    return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
  File "/home/maj/.virtualenvs/objdetTF3/local/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 953, in concat_v2
    "ConcatV2", values=values, axis=axis, name=name)
  File "/home/maj/.virtualenvs/objdetTF3/local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/maj/.virtualenvs/objdetTF3/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/home/maj/.virtualenvs/objdetTF3/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [1,384,32,32] vs. shape[1] = [1,32,30,30]
         [[Node: concat_13 = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](concat_13-0-TransposeNHWCToNCHW-LayoutOptimizer, max_pool_18, concat_13-2-LayoutOptimizer)]]
         [[Node: gradients/concat_15_grad/ConcatOffset/_345 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1485_gradients/concat_15_grad/ConcatOffset", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

I tried to track the origin of the error, but I could not find where it is failing.

GeorgeSeif commented 6 years ago

Ah well first off, it looks like I have a typo on line 191. pool_scale should probably be 16. That should fix the 32x32 vs 30x30 sizes.

Could you try changing that?

arasharchor commented 6 years ago

Oh right. I missed that 17. I corrected that and now it is working. made a fork. thanks :)

GeorgeSeif / Semantic-Segmentation-Suite