Training issue - Githubissues

m0ckingbird23 commented 5 years ago

Hello , i tried to re-tain the model using the English dataset (synth90k), after converting the data to the tensorflow records and after launching the training , i got this message :

I0701 10:42:57.537306 8080 train_shadownet.py:572] Use single gpu to train the model
2019-07-01 10:43:00.951150: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-07-01 10:43:00.984350: I tensorflow/core/common_runtime/process_util.cc:69] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
I0701 10:43:01.212757 8080 train_shadownet.py:271] Training from scratch
Traceback (most recent call last):
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
    return fn(*args)
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
     [[Node: val_IteratorGetNext = IteratorGetNext[output_shapes=[[32,32,100,3], <unknown>, [32]], output_types=[DT_FLOAT, DT_VARIANT, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator_1)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train_shadownet.py", line 578, in <module>
    need_decode=args.decode_outputs
  File "tools/train_shadownet.py", line 324, in train_shadownet
    [optimizer, train_ctc_loss, merge_summary_op])
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
    run_metadata_ptr)
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
    run_metadata)
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
     [[Node: val_IteratorGetNext = IteratorGetNext[output_shapes=[[32,32,100,3], <unknown>, [32]], output_types=[DT_FLOAT, DT_VARIANT, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator_1)]]

Caused by op 'val_IteratorGetNext', defined at:
  File "tools/train_shadownet.py", line 578, in <module>
    need_decode=args.decode_outputs
  File "tools/train_shadownet.py", line 159, in train_shadownet
    batch_size=CFG.TRAIN.BATCH_SIZE
  File "/media/khalyl/b19f6211-f6a7-443d-8a50-5c247986129e/khalyl/Desktop/image_recog/CRNN_Tensorflow-master/data_provider/shadownet_data_feed_pipline.py", line 289, in inputs
    num_threads=CFG.TRAIN.CPU_MULTI_PROCESS_NUMS
  File "/media/khalyl/b19f6211-f6a7-443d-8a50-5c247986129e/khalyl/Desktop/image_recog/CRNN_Tensorflow-master/data_provider/tf_io_pipline_fast_tools.py", line 406, in inputs
    return iterator.get_next(name='{:s}_IteratorGetNext'.format(self._dataset_flag))
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 410, in get_next
    name=name)), self._output_types,
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2069, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
    return func(*args, **kwargs)
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
    op_def=op_def)
  File "/home/khalyl/anaconda3/envs/rcnnenv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in __init__
    self._traceback = tf_stack.extract_stack()

OutOfRangeError (see above for traceback): End of sequence
     [[Node: val_IteratorGetNext = IteratorGetNext[output_shapes=[[32,32,100,3], <unknown>, [32]], output_types=[DT_FLOAT, DT_VARIANT, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator_1)]]

i was wondering what's wrong i didn't change anything , is it a CPU issue or tensorflow one? and do you have any idea how can i fix it ?

MaybeShewill-CV commented 5 years ago

@lylk23 Did you use the same version of tensorflow in requirement.txt. Seems you met the error when you try to get next batch data:）

m0ckingbird23 commented 5 years ago

no i am using tf 1.10.0 i should use the tf 1.12.0 ??

MaybeShewill-CV commented 5 years ago

@lylk23 You may upgrade your Tensorflow and check if the problem still exist:)

m0ckingbird23 commented 5 years ago

well i just did and this is what i get now:

I0701 12:26:13.678080 6510 train_shadownet.py:572] Use single gpu to train the model
2019-07-01 12:26:16.445877: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-07-01 12:26:16.595548: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-01 12:26:16.596396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce GTX 780M major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:01:00.0
totalMemory: 3.94GiB freeMemory: 3.61GiB
2019-07-01 12:26:16.596419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
Traceback (most recent call last):
  File "tools/train_shadownet.py", line 578, in <module>
    need_decode=args.decode_outputs
  File "tools/train_shadownet.py", line 260, in train_shadownet
    sess = tf.Session(config=sess_config)
  File "/home/khalyl/anaconda3/envs/myenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/khalyl/anaconda3/envs/myenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

MaybeShewill-CV commented 5 years ago

@lylk23 Reinstall your CUDA driver. You may google how to install it:)

m0ckingbird23 commented 5 years ago

the problem is that i cant upgrade my CUDA driver cuz it wont be compatible with my graphic card.

MaybeShewill-CV commented 5 years ago

@lylk23 Maybe you need to upgrade your hard device or find another dataset input solution with tensorflow 1.10:)

MaybeShewill-CV / CRNN_Tensorflow

Training issue #296