daniel-kukiela / nmt-chatbot

NMT Chatbot
GNU General Public License v3.0
385 stars 213 forks source link

Train.py Input/output error #116

Open Prickman opened 5 years ago

Prickman commented 5 years ago

I'm trying to train a chatbot on google colab, and I've trained around 16k steps successfully before encountering this error:

Exception in thread Thread-1: Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: data/train.bpe.to; Input/output error [[{{node IteratorGetNext}} = IteratorGetNextoutput_shapes=[[?,?], [?,?], [?,?], [?], [?]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]] [[{{node dynamic_seq2seq/decoder/LuongAttention/memory_layer/Tensordot/GatherV2_1/_291}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge2171...GatherV2_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "train.py", line 88, in nmt_train tf.app.run(main=nmt.main, argv=[os.getcwd() + '\nmt\nmt\nmt.py'] + unparsed) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/content/drive/testingv2/nmt-chatbot/nmt/nmt/nmt.py", line 599, in main run_main(FLAGS, default_hparams, train_fn, inference_fn) File "/content/drive/testingv2/nmt-chatbot/nmt/nmt/nmt.py", line 592, in run_main train_fn(hparams, target_session=target_session, summary_callback=summary_callback) File "/content/drive/testingv2/nmt-chatbot/nmt/nmt/train.py", line 358, in train step_result = loaded_train_model.train(train_sess) File "/content/drive/testingv2/nmt-chatbot/nmt/nmt/model.py", line 266, in train self.learning_rate]) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: data/train.bpe.to; Input/output error [[node IteratorGetNext (defined at /content/drive/testingv2/nmt-chatbot/nmt/nmt/utils/iterator_utils.py:196) = IteratorGetNextoutput_shapes=[[?,?], [?,?], [?,?], [?], [?]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]] [[{{node dynamic_seq2seq/decoder/LuongAttention/memory_layer/Tensordot/GatherV2_1/_291}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge2171...GatherV2_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'IteratorGetNext', defined at: File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap self._bootstrap_inner() File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, *self._kwargs) File "train.py", line 88, in nmt_train tf.app.run(main=nmt.main, argv=[os.getcwd() + '\nmt\nmt\nmt.py'] + unparsed) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/content/drive/testingv2/nmt-chatbot/nmt/nmt/nmt.py", line 599, in main run_main(FLAGS, default_hparams, train_fn, inference_fn) File "/content/drive/testingv2/nmt-chatbot/nmt/nmt/nmt.py", line 592, in run_main train_fn(hparams, target_session=target_session, summary_callback=summary_callback) File "/content/drive/testingv2/nmt-chatbot/nmt/nmt/train.py", line 302, in train train_model = model_helper.create_train_model(model_creator, hparams, scope) File "/content/drive/testingv2/nmt-chatbot/nmt/nmt/model_helper.py", line 100, in create_train_model shard_index=jobid) File "/content/drive/testingv2/nmt-chatbot/nmt/nmt/utils/iterator_utils.py", line 196, in get_iterator tgt_seq_len) = (batched_iter.get_next()) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 421, in get_next name=name)), self._output_types, File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2069, in iterator_get_next output_shapes=output_shapes, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): data/train.bpe.to; Input/output error [[node IteratorGetNext (defined at /content/drive/testingv2/nmt-chatbot/nmt/nmt/utils/iterator_utils.py:196) = IteratorGetNextoutput_shapes=[[?,?], [?,?], [?,?], [?], [?]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]] [[{{node dynamic_seq2seq/decoder/LuongAttention/memory_layer/Tensordot/GatherV2_1/_291}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge2171...GatherV2_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Btw, I resumed the code after stopping at 16k steps to quickly test the bot using inference.py. All of the settings are default, except for the epochs (I've custom defined 5 epochs).

Prickman commented 5 years ago

I restarted train.py after deleting epochs_passed, and it gave me the same error again after only 100 steps this time.

Prickman commented 5 years ago

I re-ran prepare_data.py, and reliably trained up to 31k steps before training.py threw the same exception again. When I re-run training.py, it throws the same exception again. Any help please?

kaiyu-tang commented 5 years ago

@Prickman have you fixed the problem now?