MaybeShewill-CV / CRNN_Tensorflow

Convolutional Recurrent Neural Networks(CRNN) for Scene Text Recognition
MIT License
1.03k stars 388 forks source link

tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [[{{node train_IteratorGetNext}} = IteratorGetNext[output_shapes=[[32,32,100,3], <unknown>, [32]], output_types=[DT_FLOAT, DT_VARIANT, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]] #304

Closed kspook closed 5 years ago

kspook commented 5 years ago

@MaybeShewill-CV, it's not nvidia-smi at #295


(crnntf) kspook@MLNC6:/usr/local/cuda/samples/bin/x86_64/linux/release$ nvidia-smi
Thu Jul  4 11:00:00 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000D691:00:00.0 Off |                    0 |
| N/A   47C    P0    57W / 149W |      0MiB / 11439MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(crnntf) kspook@MLNC6:/usr/local/cuda/samples/bin/x86_64/linux/release$ 

cuda9.0 installed successfully.


crnntf) kspook@MLNC6:/usr/local/cuda/samples/bin/x86_64/linux/release$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla K80"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    3.7
  Total amount of global memory:                 11440 MBytes (11995578368 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            824 MHz (0.82 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   54929 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

cudnn 7 installed successfully


(crnntf) kspook@MLNC6:~/cudnn_samples_v7/mnistCUDNN$ ./mnistCUDNN
cudnnGetVersion() : 7601 , CUDNN_VERSION from cudnn.h : 7601 (7.6.1)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 13  Capabilities 3.7, SmClock 823.5 Mhz, MemSize (Mb) 11439, MemClock 2505.0 Mhz, Ecc=1, boardGroupID=0
Using device 0

Testing single precision
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 2
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.132224 time requiring 100 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.133056 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.157568 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.244576 time requiring 207360 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.406240 time requiring 203008 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006 

Result of classification: 1 3 5

Test passed!

Testing half precision (math in single precision)
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 2
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.139008 time requiring 100 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.141408 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.176672 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.252768 time requiring 207360 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.431808 time requiring 203008 memory
Resulting weights from Softmax:
0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006 

Result of classification: 1 3 5

Test passed!

The error still occurred.


(crnntf) kspook@MLNC6:~/CRNN_Tensorflow$ python tools/train_shadownet.py --dataset_dir ./data/ --char_dict_path ./data/char_dict/char_dict.json --ord_map_dict_path ./data/char_dict/ord_map.json 
I0704 11:05:45.903860 17823 train_shadownet.py:569] Use single gpu to train the model
2019-07-04 11:05:49.530389: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-04 11:05:54.737760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: d691:00:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2019-07-04 11:05:54.737815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-04 11:05:55.023881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-04 11:05:55.023945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-07-04 11:05:55.023963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-07-04 11:05:55.024223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10295 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: d691:00:00.0, compute capability: 3.7)
I0704 11:05:55.310481 17823 train_shadownet.py:268] Training from scratch
Traceback (most recent call last):
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
     [[{{node train_IteratorGetNext}} = IteratorGetNext[output_shapes=[[32,32,100,3], <unknown>, [32]], output_types=[DT_FLOAT, DT_VARIANT, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train_shadownet.py", line 575, in <module>
    need_decode=args.decode_outputs
  File "tools/train_shadownet.py", line 321, in train_shadownet
    [optimizer, train_ctc_loss, merge_summary_op])
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
     [[node train_IteratorGetNext (defined at /data/home/kspook/CRNN_Tensorflow/data_provider/tf_io_pipline_fast_tools.py:406)  = IteratorGetNext[output_shapes=[[32,32,100,3], <unknown>, [32]], output_types=[DT_FLOAT, DT_VARIANT, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]

Caused by op 'train_IteratorGetNext', defined at:
  File "tools/train_shadownet.py", line 575, in <module>
    need_decode=args.decode_outputs
  File "tools/train_shadownet.py", line 153, in train_shadownet
    batch_size=CFG.TRAIN.BATCH_SIZE
  File "/data/home/kspook/CRNN_Tensorflow/data_provider/shadownet_data_feed_pipline.py", line 289, in inputs
    num_threads=CFG.TRAIN.CPU_MULTI_PROCESS_NUMS
  File "/data/home/kspook/CRNN_Tensorflow/data_provider/tf_io_pipline_fast_tools.py", line 406, in inputs
    return iterator.get_next(name='{:s}_IteratorGetNext'.format(self._dataset_flag))
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 421, in get_next
    name=name)), self._output_types,
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2069, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/kspook/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

OutOfRangeError (see above for traceback): End of sequence
     [[node train_IteratorGetNext (defined at /data/home/kspook/CRNN_Tensorflow/data_provider/tf_io_pipline_fast_tools.py:406)  = IteratorGetNext[output_shapes=[[32,32,100,3], <unknown>, [32]], output_types=[DT_FLOAT, DT_VARIANT, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]
MaybeShewill-CV commented 5 years ago

@kspook Everything works fine using tensorflow 1.12 with gtx1070 CUDA 9.0 in my local machine. Perhaps you need to raise an issue under tensorflow if you use the same version tensorflow:)

kspook commented 5 years ago

@MaybeShewill-CV , it's write_tf_records.py issue. I used 49 characters with numbers, alphabet, Korean. but in char_dict.json there are 10 line,in ord_map.json there are 20 line. There are 0 byte tfrecords. you can check here. https://drive.google.com/open?id=1TpfrQpi6h7cn1cH-y8NOmTXjrOH2DbtV Korean images : image-data/hangul-images/

I0704 21:46:55.333094 6091 shadownet_data_feed_pipline.py:159] Start initialize train sample information list...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 162/162 [00:00<00:00, 166293.99it/s]
I0704 21:46:55.339854 6091 shadownet_data_feed_pipline.py:174] Start initialize validation sample information list...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 86323.45it/s]
I0704 21:46:55.340504 6091 shadownet_data_feed_pipline.py:188] Start initialize testing sample information list...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 78012.22it/s]
I0704 21:46:55.341225 6091 shadownet_data_feed_pipline.py:212] Char set length: 12
I0704 21:46:55.341894 6091 shadownet_data_feed_pipline.py:219] Write char dict map complete
I0704 21:46:55.341991 6091 shadownet_data_feed_pipline.py:83] Generating training sample tfrecords...
I0704 21:46:55.342212 6091 tf_io_pipline_fast_tools.py:449] Start filling train dataset sample information queue...
  0%|                                                                                                                                           | 0/162 [00:00<?, ?it/s]E0704 21:46:55.342514 6091 tf_io_pipline_fast_tools.py:462] Lexicon doesn't contain lexicon index 54616
E0704 21:46:55.342586 6091 tf_io_pipline_fast_tools.py:462] Lexicon doesn't contain lexicon index 54616
E0704 21:46:55.342641 6091 tf_io_pipline_fast_tools.py:462] Lexicon doesn't contain lexicon index 45208

I think I couldn't make lexicon.txt in the right way. Do you know how to fix?

MaybeShewill-CV commented 5 years ago

@kspook You may check the way in which the Synth90k dataset is orgnized and the two json file will automatically generated during making tensorflow records:)

kspook commented 5 years ago

@MaybeShewill-CV, I did. two files char_dict.json, ord_map.json were made automatically.

My question is that tf_io_pipeline_fast_tools.py can't handle lexcon.txt even though I made the same style as Syn90k dataset.

MaybeShewill-CV commented 5 years ago

@kspook If you met error when testing the tools on Synth90k dataset you may put error information here. If nothing happened when you training synth90k dataset but met error during the training process on your own dataset please check your dataset's file yourself. There must be something wrong with your label file:)

kspook commented 5 years ago

@MaybeShewill-CV, I think I have the same situation as #285. but I have the above error without index in the line.

If I put the index, I have naturally an error.

I0708 03:45:20.357513 12066 shadownet_data_feed_pipline.py:159] Start initialize train sample information list...
  0%|                                                                                                                                           | 0/162 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "tools/write_tfrecords.py", line 74, in <module>
    save_dir=args.save_dir
  File "tools/write_tfrecords.py", line 56, in write_tfrecords
    writer_process_nums=8
  File "/data/home/kspook/CRNN_Tensorflow/data_provider/shadownet_data_feed_pipline.py", line 64, in __init__
    self._init_dataset_sample_info()
  File "/data/home/kspook/CRNN_Tensorflow/data_provider/shadownet_data_feed_pipline.py", line 164, in _init_dataset_sample_info
    image_name, label_index = line.rstrip('\r').rstrip('\n').split(' ')
ValueError: too many values to unpack (expected 2)
kspook commented 5 years ago

@MaybeShewill-CV, can you upload chinese data on Google. I can't download it and check.

MaybeShewill-CV commented 5 years ago

@kspook Check your dataset's format according to Synth90k dataset:)

kspook commented 5 years ago

I did. the answer must be #302. lack of data. but in my case, tfrecords were not perfect, due to lexicon index errors.
can you check my file? https://drive.google.com/open?id=1k0qsklB8Y1IbMUBOurnTKTEhUUxw_pwK I don't think it is different from syn90k.

I am also interested in how to make file in Chinese. Unlike English, Chinese was converted to numbers. How did you make Chinese words? How can you identify two characters?

according to this, https://github.com/MaybeShewill-CV/CRNN_Tensorflow/issues/285#issuecomment-505333966, a chinese word looks to have one index(number). Am I right?

MaybeShewill-CV commented 5 years ago

@kspook Maybe you could test if the problem still exist after you enlarging your dataset:)

kspook commented 5 years ago

@MaybeShewill-CV , I manged to use syn90k. can I just ignore for 'PREMATURE END OF IMAGE' ?

there was posting before #112, but no exact answer.

kspook commented 5 years ago

I did with syn90k. I am interested in Chinese and Korean. In case chinese, your replaced every chinese character into number. How did you manage for more than two characters? How can this script identify two numbers in the two-character words?

i.e chinese 1, chinese 2 --> one word ord(chinese1), ord(chinese2) --> 50000,5002 how did you transfer them to one word?

I did. the answer must be #302. lack of data. but in my case, tfrecords were not perfect, due to lexicon index errors. can you check my file? https://drive.google.com/open?id=1k0qsklB8Y1IbMUBOurnTKTEhUUxw_pwK I don't think it is different from syn90k.

I am also interested in how to make file in Chinese. Unlike English, Chinese was converted to numbers. How did you make Chinese words? How can you identify two characters?

according to this, #285 (comment), a chinese word looks to have one index(number). Am I right?

MaybeShewill-CV commented 5 years ago

@kspook

@MaybeShewill-CV , I manged to use syn90k. can I just ignore for 'PREMATURE END OF IMAGE' ?

there was posting before #112, but no exact answer.

Your image file is not complete or not valid

I did with syn90k. I am interested in Chinese and Korean. In case chinese, your replaced every chinese character into number. How did you manage for more than two characters? How can this script identify two numbers in the two-character words?

i.e chinese 1, chinese 2 --> one word ord(chinese1), ord(chinese2) --> 50000,5002 how did you transfer them to one word?

I did. the answer must be #302. lack of data. but in my case, tfrecords were not perfect, due to lexicon index errors. can you check my file? https://drive.google.com/open?id=1k0qsklB8Y1IbMUBOurnTKTEhUUxw_pwK I don't think it is different from syn90k. I am also interested in how to make file in Chinese. Unlike English, Chinese was converted to numbers. How did you make Chinese words? How can you identify two characters? according to this, #285 (comment), a chinese word looks to have one index(number). Am I right?

I did not transform them into one word. You may probably misunderstand the model:)

kspook commented 5 years ago

then what is ur understanding?

how did u make chinese lexicon and labels file?

this model understands ascii values, right? did you use Chinese for lexicon and labels?

MaybeShewill-CV commented 5 years ago

@kspook Yep, I've trained Chinese model and posted it here:)

kspook commented 5 years ago

@MaybeShewill-CV, you got me wrong.

If i put Korean Character in annotation_train.txt, then I have this error. So, my question is how you dealt with Chinese character. I thought you transfromed Chinese character to numbers.

Traceback (most recent call last):
  File "tools/write_tfrecords.py", line 74, in <module>
    save_dir=args.save_dir
  File "tools/write_tfrecords.py", line 56, in write_tfrecords
    writer_process_nums=8
  File "/data/home/kspook/CRNN_Tensorflow/data_provider/shadownet_data_feed_pipline.py", line 64, in __init__
    self._init_dataset_sample_info()
  File "/data/home/kspook/CRNN_Tensorflow/data_provider/shadownet_data_feed_pipline.py", line 166, in _init_dataset_sample_info
    label_index = int(label_index)
ValueError: invalid literal for int() with base 10: '\ud558'
MaybeShewill-CV commented 5 years ago

@kspook No matter English characters or chinese characters they share the same way of generating tensorflow records:)

kspook commented 5 years ago

@MaybeShewill-CV, thank you.

Recently, I wasn't in the situation to download syn90k for long time. So, I used wrong information with old data. Finally I could download file again , and I found the problem when I made tfrecords. Now I can train Korean data.

MaybeShewill-CV commented 5 years ago

@kspook ok :)

yds5817 commented 5 years ago

@kspook I also encountered this problem. Traceback (most recent call last): File "tools/write_tfrecords.py", line 74, in <module> save_dir=args.save_dir File "tools/write_tfrecords.py", line 56, in write_tfrecords writer_process_nums=8 File "/data/home/kspook/CRNN_Tensorflow/data_provider/shadownet_data_feed_pipline.py", line 64, in __init__ self._init_dataset_sample_info() File "/data/home/kspook/CRNN_Tensorflow/data_provider/shadownet_data_feed_pipline.py", line 166, in _init_dataset_sample_info label_index = int(label_index) ValueError: invalid literal for int() with base 10: '\ud558' How did you solve it