HasnainRaz / Skin-Segmentation-TensorFlow

A modified SegNet Convolutional Neural Net for segmenting human skin from images
MIT License
58 stars 12 forks source link

Problem in training #9

Closed quanghuy0497 closed 5 years ago

quanghuy0497 commented 5 years ago

When I began to train your model, I had this issue, but I have no ideas how to fix it. Could you please help me fix it? Thank you very much

Train loss:  0.0 Train iou:  0.0
Val. loss:  0.6931054 Val. iou:  0.4069072
Starting epoch:  0
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[4,256,121,161] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node up2/conv2d_transpose}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[iou_metric/confusion_matrix/stack_1/_81]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[4,256,121,161] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node up2/conv2d_transpose}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "model.py", line 232, in <module>
    train(image_paths, mask_paths, val_image_paths, val_mask_paths)
  File "model.py", line 156, in train
    [train, cost, iou_update, seg_image], feed_dict=train_feed_dict)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[4,256,121,161] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node up2/conv2d_transpose (defined at /tmp/tmpwtsgo0a_.py:68) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[iou_metric/confusion_matrix/stack_1/_81]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[4,256,121,161] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node up2/conv2d_transpose (defined at /tmp/tmpwtsgo0a_.py:68) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node up2/conv2d_transpose:
 up2/stack (defined at /tmp/tmpwtsgo0a_.py:67)  
 batch_normalization_9/cond/Merge (defined at /tmp/tmp5xi7m83o.py:14)   
 up2/kernel/read (defined at model.py:33)

Input Source operations connected to node up2/conv2d_transpose:
 up2/stack (defined at /tmp/tmpwtsgo0a_.py:67)  
 batch_normalization_9/cond/Merge (defined at /tmp/tmp5xi7m83o.py:14)   
 up2/kernel/read (defined at model.py:33)

Original stack trace for 'up2/conv2d_transpose':
  File "model.py", line 232, in <module>
    train(image_paths, mask_paths, val_image_paths, val_mask_paths)
  File "model.py", line 114, in train
    logits = inference(image_placeholder, training_flag)
  File "model.py", line 79, in inference
    up2 = trans_conv_with_bn(unconv3, 256, [3, 3], is_training, name='up2')
  File "model.py", line 33, in trans_conv_with_bn
    use_bias=use_bias, kernel_initializer=tf.contrib.layers.xavier_initializer(), name=name)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/layers/convolutional.py", line 1279, in conv2d_transpose
    return layer.apply(inputs)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1479, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/layers/base.py", line 537, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 634, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 146, in wrapper
    ), args, kwargs)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 450, in converted_call
    result = converted_f(*effective_args, **kwargs)
  File "/tmp/tmpwtsgo0a_.py", line 68, in tf__call
    outputs = ag__.converted_call('conv2d_transpose', backend, ag__.ConversionOptions(recursive=True, force_conversion=False, optional_features=(), internal_convert_user_code=True), (inputs, self.kernel, output_shape_tensor), {'strides': self.strides, 'padding': self.padding, 'data_format': self.data_format, 'dilation_rate': self.dilation_rate})
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 356, in converted_call
    return _call_unconverted(f, args, kwargs)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 253, in _call_unconverted
    return f(*args, **kwargs)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/keras/backend.py", line 4582, in conv2d_transpose
    data_format=tf_data_format)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 2147, in conv2d_transpose
    name=name)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 2218, in conv2d_transpose_v2
    name=name)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1407, in conv2d_backprop_input
    name=name)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/home/user/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()
HasnainRaz commented 5 years ago

It seems you dont have enough GPU memory to train the model. It is an OOM error (out of memory). Try using smaller dimensional images or reduce the batch size.

quanghuy0497 commented 5 years ago

thank you very much, I solved this issue.