deepgram / kur

Descriptive Deep Learning
Apache License 2.0
814 stars 107 forks source link

Please add some debug info to help finding the error cause #40

Closed YinJerry closed 7 years ago

YinJerry commented 7 years ago

During the training, I met an ERROR. Have you met this before and what problem it could be? Training data problem? I've checked there is no empty string in the audio transcripts. Is there any other reason? Or do you have any debug log information that I can see which transcript (such as uuid) triggered this error?

[ERROR 2017-03-16 22:28:55,510 kur.model.executor:236] Exception raised during training. Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1021, in _do_call return fn(*args) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1003, in _run_fn status, run_metadata) File "/usr/lib/python3.4/contextlib.py", line 66, in exit next(self.gen) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Labels length is zero in batch 17 [[Node: CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](Log/_659, ToInt64/_661, GatherNd, Squeeze_2/_663)]]

ajsyp commented 7 years ago

Something you might try is: yes | kur -vv train --step Kurfile.yml. This will generate a lot of output, but when it crashes, you should be able to see the data in the last batch that was submitted (including audio UUIDs).

YinJerry commented 7 years ago

Ok, I'm trying and I'll check whether it works. Thank you for your suggestion. I'll come back soon.

YinJerry commented 7 years ago

I tried "yes | kur -vv train --step speech-Jerry.yml" (not Kurfile.yml) but I did not find expected UUID information in DEBUG log. Can you help to check? I did not update the kur version since I met this issue. The following is the printed log:

Epoch 41/inf, loss=151.441: 58%|█████▊ | 86848/148688 [2:24:03<1:39:55, 10.31samples/s][DEBUG 2017-03-26 23:40:21,947 kur.model.executor:590] Training on batch... [DEBUG 2017-03-26 23:40:21,949 kur.providers.batch_provider:156] Preparing next batch of data... W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Labels length is zero in batch 17

[ERROR 2017-03-26 23:40:23,284 kur.model.executor:236] Exception raised during training. Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1021, in _do_call return fn(*args) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1003, in _run_fn status, run_metadata) File "/usr/lib/python3.4/contextlib.py", line 66, in exit next(self.gen) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Labels length is zero in batch 17 [[Node: CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](Log/_659, ToInt64/_661, GatherNd, Squeeze_2/_663)]] [[Node: gradients/concat_12_grad/Slice_1/_685 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1263_gradients/concat_12_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/kur/model/executor.py", line 233, in train *kwargs File "/usr/local/lib/python3.4/dist-packages/kur/model/executor.py", line 598, in wrapped_train model=self.model, data=batch) File "/usr/local/lib/python3.4/dist-packages/kur/model/executor.py", line 838, in try_func result = func(args, **kwargs) File "/usr/local/lib/python3.4/dist-packages/kur/backend/keras_backend.py", line 786, in train return self.run_batch(model, data, 'train', True) File "/usr/local/lib/python3.4/dist-packages/kur/backend/keras_backend.py", line 771, in run_batch outputs = compiled'func' File "/usr/local/lib/python3.4/dist-packages/keras/backend/tensorflow_backend.py", line 1943, in call feed_dict=feed_dict) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 766, in run run_metadata_ptr) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 964, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Labels length is zero in batch 17 [[Node: CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](Log/_659, ToInt64/_661, GatherNd, Squeeze_2/_663)]] [[Node: gradients/concat_12_grad/Slice_1/_685 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1263_gradients/concat_12_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]

Caused by op 'CTCLoss', defined at: File "/usr/local/bin/kur", line 11, in load_entry_point('kur==0.4.0rc0', 'console_scripts', 'kur')() File "/usr/local/lib/python3.4/dist-packages/kur/main.py", line 382, in main sys.exit(args.func(args) or 0) File "/usr/local/lib/python3.4/dist-packages/kur/main.py", line 62, in train func(step=args.step) File "/usr/local/lib/python3.4/dist-packages/kur/kurfile.py", line 371, in func return trainer.train(defaults) File "/usr/local/lib/python3.4/dist-packages/kur/model/executor.py", line 233, in train kwargs File "/usr/local/lib/python3.4/dist-packages/kur/model/executor.py", line 553, in wrapped_train self.compile('train', with_provider=provider) File "/usr/local/lib/python3.4/dist-packages/kur/model/executor.py", line 113, in compile **kwargs File "/usr/local/lib/python3.4/dist-packages/kur/backend/keras_backend.py", line 641, in compile self.process_loss(model, loss) File "/usr/local/lib/python3.4/dist-packages/kur/backend/keras_backend.py", line 557, in process_loss self.find_compiled_layer_by_name(model, target) File "/usr/local/lib/python3.4/dist-packages/kur/loss/ctc.py", line 234, in get_loss transcript_length File "/usr/local/lib/python3.4/dist-packages/keras/backend/tensorflow_backend.py", line 3042, in ctc_batch_cost sequence_length=input_length), 1) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/ops/ctc_ops.py", line 145, in ctc_loss ctc_merge_repeated=ctc_merge_repeated) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/ops/gen_ctc_ops.py", line 164, in _ctc_loss name=name) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op op_def=op_def) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 1128, in init self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Labels length is zero in batch 17 [[Node: CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](Log/_659, ToInt64/_661, GatherNd, Squeeze_2/_663)]] [[Node: gradients/concat_12_grad/Slice_1/_685 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1263_gradients/concat_12_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]

[DEBUG 2017-03-26 23:40:23,292 kur.loggers.binary_logger:135] Adding data to binary column: batch_loss_asr [DEBUG 2017-03-26 23:40:23,309 kur.loggers.binary_logger:135] Adding data to binary column: batch_loss_total [DEBUG 2017-03-26 23:40:23,318 kur.loggers.binary_logger:135] Adding data to binary column: batch_loss_time [DEBUG 2017-03-26 23:40:23,324 kur.loggers.binary_logger:135] Adding data to binary column: batch_loss_batch [DEBUG 2017-03-26 23:40:23,331 kur.loggers.binary_logger:144] Writing logger summary. [DEBUG 2017-03-26 23:40:23,873 kur.providers.batch_provider:204] Next batch of data has been prepared. Exception ignored in: <bound method Session.del of <tensorflow.python.client.session.Session object at 0x7f5cc12dac50>> Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 581, in del AttributeError: 'NoneType' object has no attribute 'TF_DeleteStatus'

ajsyp commented 7 years ago

Do you know what version of Kur you are using? From PyPI? From git (if so, what commit are you at)? When I use the off-the-shelf speech recognition example (speech.yml):

$ yes | kur -vv train --step speech.yml
[INFO 2017-03-28 12:52:44,630 kur.kurfile:710] Parsing source: speech.yml, included by top-level.
[INFO 2017-03-28 12:52:44,652 kur.kurfile:85] Parsing Kurfile...
[DEBUG 2017-03-28 12:52:44,653 kur.kurfile:827] Parsing Kurfile section: settings
[DEBUG 2017-03-28 12:52:44,658 kur.kurfile:827] Parsing Kurfile section: train
...
[DEBUG 2017-03-28 12:53:47,185 kur.providers.batch_provider:156] Preparing next batch of data...
...
audio_source (16,): ['/home/ajsyp/kur/lsdc-train/audio/67290d16-4254-4db3-93cf-a9c26cc6e19b'
'/home/ajsyp/kur/lsdc-train/audio/d26b6d7c-c672-4504-9ffc-8d7035762260'
'/home/ajsyp/kur/lsdc-train/audio/e375c6a1-cd1f-4308-a9ad-c18177452706'
'/home/ajsyp/kur/lsdc-train/audio/f568130e-3e23-4801-bb00-7abd54908be9'
...

If you aren't getting anything like that, then knowing your Kur version (and installation method / git commit) and having your Kurfile will help debug.

ajsyp commented 7 years ago

Any more information you can offer / response to my last comment?

YinJerry commented 7 years ago

I am sorry I am on a business trip and can not check what you recommended. You can close this for now. I will give you feedback when I come back in two weeks. Thank you!