Closed YinJerry closed 7 years ago
Something you might try is: yes | kur -vv train --step Kurfile.yml
. This will generate a lot of output, but when it crashes, you should be able to see the data in the last batch that was submitted (including audio UUIDs).
Ok, I'm trying and I'll check whether it works. Thank you for your suggestion. I'll come back soon.
I tried "yes | kur -vv train --step speech-Jerry.yml" (not Kurfile.yml) but I did not find expected UUID information in DEBUG log. Can you help to check? I did not update the kur version since I met this issue. The following is the printed log:
Epoch 41/inf, loss=151.441: 58%|█████▊ | 86848/148688 [2:24:03<1:39:55, 10.31samples/s][DEBUG 2017-03-26 23:40:21,947 kur.model.executor:590] Training on batch... [DEBUG 2017-03-26 23:40:21,949 kur.providers.batch_provider:156] Preparing next batch of data... W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Labels length is zero in batch 17
[ERROR 2017-03-26 23:40:23,284 kur.model.executor:236] Exception raised during training. Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1021, in _do_call return fn(*args) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1003, in _run_fn status, run_metadata) File "/usr/lib/python3.4/contextlib.py", line 66, in exit next(self.gen) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Labels length is zero in batch 17 [[Node: CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](Log/_659, ToInt64/_661, GatherNd, Squeeze_2/_663)]] [[Node: gradients/concat_12_grad/Slice_1/_685 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1263_gradients/concat_12_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/kur/model/executor.py", line 233, in train *kwargs File "/usr/local/lib/python3.4/dist-packages/kur/model/executor.py", line 598, in wrapped_train model=self.model, data=batch) File "/usr/local/lib/python3.4/dist-packages/kur/model/executor.py", line 838, in try_func result = func(args, **kwargs) File "/usr/local/lib/python3.4/dist-packages/kur/backend/keras_backend.py", line 786, in train return self.run_batch(model, data, 'train', True) File "/usr/local/lib/python3.4/dist-packages/kur/backend/keras_backend.py", line 771, in run_batch outputs = compiled'func' File "/usr/local/lib/python3.4/dist-packages/keras/backend/tensorflow_backend.py", line 1943, in call feed_dict=feed_dict) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 766, in run run_metadata_ptr) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 964, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Labels length is zero in batch 17 [[Node: CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](Log/_659, ToInt64/_661, GatherNd, Squeeze_2/_663)]] [[Node: gradients/concat_12_grad/Slice_1/_685 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1263_gradients/concat_12_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]
Caused by op 'CTCLoss', defined at:
File "/usr/local/bin/kur", line 11, in
InvalidArgumentError (see above for traceback): Labels length is zero in batch 17 [[Node: CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](Log/_659, ToInt64/_661, GatherNd, Squeeze_2/_663)]] [[Node: gradients/concat_12_grad/Slice_1/_685 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1263_gradients/concat_12_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]
[DEBUG 2017-03-26 23:40:23,292 kur.loggers.binary_logger:135] Adding data to binary column: batch_loss_asr [DEBUG 2017-03-26 23:40:23,309 kur.loggers.binary_logger:135] Adding data to binary column: batch_loss_total [DEBUG 2017-03-26 23:40:23,318 kur.loggers.binary_logger:135] Adding data to binary column: batch_loss_time [DEBUG 2017-03-26 23:40:23,324 kur.loggers.binary_logger:135] Adding data to binary column: batch_loss_batch [DEBUG 2017-03-26 23:40:23,331 kur.loggers.binary_logger:144] Writing logger summary. [DEBUG 2017-03-26 23:40:23,873 kur.providers.batch_provider:204] Next batch of data has been prepared. Exception ignored in: <bound method Session.del of <tensorflow.python.client.session.Session object at 0x7f5cc12dac50>> Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 581, in del AttributeError: 'NoneType' object has no attribute 'TF_DeleteStatus'
Do you know what version of Kur you are using? From PyPI? From git (if so, what commit are you at)? When I use the off-the-shelf speech recognition example (speech.yml
):
$ yes | kur -vv train --step speech.yml
[INFO 2017-03-28 12:52:44,630 kur.kurfile:710] Parsing source: speech.yml, included by top-level.
[INFO 2017-03-28 12:52:44,652 kur.kurfile:85] Parsing Kurfile...
[DEBUG 2017-03-28 12:52:44,653 kur.kurfile:827] Parsing Kurfile section: settings
[DEBUG 2017-03-28 12:52:44,658 kur.kurfile:827] Parsing Kurfile section: train
...
[DEBUG 2017-03-28 12:53:47,185 kur.providers.batch_provider:156] Preparing next batch of data...
...
audio_source (16,): ['/home/ajsyp/kur/lsdc-train/audio/67290d16-4254-4db3-93cf-a9c26cc6e19b'
'/home/ajsyp/kur/lsdc-train/audio/d26b6d7c-c672-4504-9ffc-8d7035762260'
'/home/ajsyp/kur/lsdc-train/audio/e375c6a1-cd1f-4308-a9ad-c18177452706'
'/home/ajsyp/kur/lsdc-train/audio/f568130e-3e23-4801-bb00-7abd54908be9'
...
If you aren't getting anything like that, then knowing your Kur version (and installation method / git commit) and having your Kurfile will help debug.
Any more information you can offer / response to my last comment?
I am sorry I am on a business trip and can not check what you recommended. You can close this for now. I will give you feedback when I come back in two weeks. Thank you!
During the training, I met an ERROR. Have you met this before and what problem it could be? Training data problem? I've checked there is no empty string in the audio transcripts. Is there any other reason? Or do you have any debug log information that I can see which transcript (such as uuid) triggered this error?
[ERROR 2017-03-16 22:28:55,510 kur.model.executor:236] Exception raised during training. Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1021, in _do_call return fn(*args) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1003, in _run_fn status, run_metadata) File "/usr/lib/python3.4/contextlib.py", line 66, in exit next(self.gen) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Labels length is zero in batch 17 [[Node: CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](Log/_659, ToInt64/_661, GatherNd, Squeeze_2/_663)]]