daniel-kukiela / nmt-chatbot

NMT Chatbot
GNU General Public License v3.0
385 stars 212 forks source link

Training rushes through all epochs after error while decoding to model/output_dev #156

Open M0rica opened 4 years ago

M0rica commented 4 years ago

First of all my specs: gtx 1070ti 8gb vram 16gb ram ryzen 7 2700 training on m.2 ssd

My issue is that the model somewhat fails to decode to the model/output_dev file while training (at diffrent steps each time, most times after 5k or 10k steps), which causes it to rush through all other epochs with the same error instantly and then finishing training. I've read about someone who had the same issue and he solved it by decreasing the batch-size, but I tried that as well and nothing helped:

decoding to output model/output_dev_5000 2020-05-08 18:17:53.781721: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 64.0KiB (rounded to 65536). Current allocation summary follows. 2020-05-08 18:17:53.786389: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): Total Chunks: 25, Chunks in use: 24. 6.3KiB allocated for chunks. 6.0KiB in use in bin. 118B client-requested in use in bin. 2020-05-08 18:17:53.791567: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512): Total Chunks: 1, Chunks in use: 0. 768B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2020-05-08 18:17:53.796353: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024): Total Chunks: 3, Chunks in use: 3. 3.8KiB allocated for chunks. 3.8KiB in use in bin. 3.0KiB client-requested in use in bin. 2020-05-08 18:17:53.802074: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2020-05-08 18:17:53.806886: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2020-05-08 18:17:53.811956: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8192): Total Chunks: 20, Chunks in use: 20. 160.0KiB allocated for chunks. 160.0KiB in use in bin. 160.0KiB client-requested in use in bin. 2020-05-08 18:17:53.816760: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2020-05-08 18:17:53.821948: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2020-05-08 18:17:53.827046: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (65536): Total Chunks: 7, Chunks in use: 7. 539.5KiB allocated for chunks. 539.5KiB in use in bin. 494.0KiB client-requested in use in bin. 2020-05-08 18:17:53.831938: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (131072): Total Chunks: 556, Chunks in use: 556. 86.89MiB allocated for chunks. 86.89MiB in use in bin. 59.73MiB client-requested in use in bin. 2020-05-08 18:17:53.837532: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2020-05-08 18:17:53.842321: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (524288): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2020-05-08 18:17:53.847328: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2020-05-08 18:17:53.851795: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2097152): Total Chunks: 9, Chunks in use: 9. 22.00MiB allocated for chunks. 22.00MiB in use in bin. 22.00MiB client-requested in use in bin. 2020-05-08 18:17:53.857609: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4194304): Total Chunks: 1, Chunks in use: 1. 5.97MiB allocated for chunks. 5.97MiB in use in bin. 3.00MiB client-requested in use in bin. 2020-05-08 18:17:53.862976: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8388608): Total Chunks: 576, Chunks in use: 576. 5.16GiB allocated for chunks. 5.16GiB in use in bin. 5.15GiB client-requested in use in bin. 2020-05-08 18:17:53.868138: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16777216): Total Chunks: 1, Chunks in use: 1. 16.14MiB allocated for chunks. 16.14MiB in use in bin. 9.16MiB client-requested in use in bin. 2020-05-08 18:17:53.873656: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (33554432): Total Chunks: 1, Chunks in use: 1. 55.00MiB allocated for chunks. 55.00MiB in use in bin. 55.00MiB client-requested in use in bin. 2020-05-08 18:17:53.878707: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2020-05-08 18:17:53.883832: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728): Total Chunks: 6, Chunks in use: 6. 885.64MiB allocated for chunks. 885.64MiB in use in bin. 842.45MiB client-requested in use in bin. 2020-05-08 18:17:53.889042: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2020-05-08 18:17:53.894022: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 64.0KiB was 64.0KiB, Chunk State: 2020-05-08 18:17:53.897269: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 6667798272 2020-05-08 18:17:57.863482: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 00000008926C5600 next 18446744073709551615 of size 16920832 2020-05-08 18:17:57.866990: I tensorflow/core/common_runtime/bfc_allocator.cc:914] Summary of in-use Chunks by size: 2020-05-08 18:17:57.869504: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 24 Chunks of size 256 totalling 6.0KiB 2020-05-08 18:17:57.872427: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 3 Chunks of size 1280 totalling 3.8KiB 2020-05-08 18:17:57.874727: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 20 Chunks of size 8192 totalling 160.0KiB 2020-05-08 18:17:57.877055: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 65536 totalling 320.0KiB 2020-05-08 18:17:57.879407: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 112128 totalling 109.5KiB 2020-05-08 18:17:57.882387: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 112640 totalling 110.0KiB 2020-05-08 18:17:57.884772: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 277 Chunks of size 131072 totalling 34.63MiB 2020-05-08 18:17:57.887408: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 149504 totalling 146.0KiB 2020-05-08 18:17:57.890274: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 278 Chunks of size 196608 totalling 52.13MiB 2020-05-08 18:17:57.892768: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 2097152 totalling 10.00MiB 2020-05-08 18:17:57.895113: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 4 Chunks of size 3145728 totalling 12.00MiB 2020-05-08 18:17:57.898162: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 6257152 totalling 5.97MiB 2020-05-08 18:17:57.900500: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 14 Chunks of size 8388608 totalling 112.00MiB 2020-05-08 18:17:57.903097: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 553 Chunks of size 9600512 totalling 4.94GiB 2020-05-08 18:17:57.905463: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 12258048 totalling 11.69MiB 2020-05-08 18:17:57.908372: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 12467456 totalling 11.89MiB 2020-05-08 18:17:57.910728: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 6 Chunks of size 12582912 totalling 72.00MiB 2020-05-08 18:17:57.913084: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 16633088 totalling 15.86MiB 2020-05-08 18:17:57.915528: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 16920832 totalling 16.14MiB 2020-05-08 18:17:57.918616: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 57671680 totalling 55.00MiB 2020-05-08 18:17:57.920980: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 153606144 totalling 732.45MiB 2020-05-08 18:17:57.923380: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 160628736 totalling 153.19MiB 2020-05-08 18:17:57.926221: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 6.21GiB 2020-05-08 18:17:57.928590: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocatedbytes: 6667798272 memorylimit: 6667798446 available bytes: 174 curr_region_allocationbytes: 13335597056 2020-05-08 18:17:57.932841: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: Limit: 6667798446 InUse: 6667797248 MaxInUse: 6667798016 NumAllocs: 666955 MaxAllocSize: 489619712

2020-05-08 18:17:57.938305: W tensorflow/core/common_runtime/bfc_allocator.cc:424] **** Exception in thread Thread-5: Traceback (most recent call last): File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call return fn(*args) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn target_list, run_metadata) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Dst tensor is not initialized. [[{{node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup}}]] (1) Internal: Dst tensor is not initialized. [[{{node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup}}]] [[dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/All/_221]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner self.run() File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "train.py", line 88, in nmt_train tf.app.run(main=nmt.main, argv=[os.getcwd() + '\nmt\nmt\nmt.py'] + unparsed) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 299, in run _run_main(main, args) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 250, in _run_main sys.exit(main(argv)) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 701, in main run_main(FLAGS, default_hparams, train_fn, inference_fn) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 694, in run_main train_fn(hparams, target_session=target_session, summary_callback=summary_callback) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 518, in train sample_tgt_data, avg_ckpts, summary_callback=summary_callback) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 351, in run_full_eval summary_writer, avg_ckpts, summary_callback=summary_callback) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 288, in run_internal_and_external_eval summary_callback=summary_callback) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 177, in run_external_eval avg_ckpts=avg_ckpts) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 740, in _external_eval infer_mode=hparams.infer_mode) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\utils\nmt_utils.py", line 60, in decode_and_evaluate nmtoutputs, = model.decode(sess) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 692, in decode output_tuple = self.infer(sess) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 680, in infer return sess.run(output_tuple) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run run_metadata_ptr) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run run_metadata) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Dst tensor is not initialized. [[node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup (defined at C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]] (1) Internal: Dst tensor is not initialized. [[node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup (defined at C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]] [[dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/All/_221]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup': File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 890, in _bootstrap self._bootstrap_inner() File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner self.run() File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run self._target(*self._args, self._kwargs) File "train.py", line 88, in nmt_train tf.app.run(main=nmt.main, argv=[os.getcwd() + '\nmt\nmt\nmt.py'] + unparsed) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 299, in run _run_main(main, args) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 250, in _run_main sys.exit(main(argv)) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 701, in main run_main(FLAGS, default_hparams, train_fn, inference_fn) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 694, in run_main train_fn(hparams, target_session=target_session, summary_callback=summary_callback) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 477, in train infer_model = model_helper.create_infer_model(model_creator, hparams, scope) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model_helper.py", line 228, in create_infer_model extra_args=extra_args) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\attention_model.py", line 64, in init extra_args=extra_args) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 95, in init res = self.build_graph(hparams, scope=scope) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 393, in build_graph self._build_decoder(self.encoder_outputs, encoder_state, hparams)) File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 587, in _build_decoder scope=decoder_scope) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\decoder.py", line 469, in dynamic_decode swap_memory=swap_memory) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2753, in while_loop return_same_structure) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2245, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2170, in _BuildLoop body_result = body(packed_vars_for_body) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2705, in body = lambda i, lv: (i + 1, orig_body(lv)) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\decoder.py", line 412, in body decoder_finished) = decoder.step(time, inputs, state) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\basic_decoder.py", line 145, in step sample_ids=sample_ids) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\helper.py", line 627, in next_inputs lambda: self._embedding_fn(sample_ids)) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func return func(*args, *kwargs) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 1235, in cond orig_res_f, res_f = context_f.BuildCondBranch(false_fn) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 1061, in BuildCondBranch original_result = fn() File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\helper.py", line 627, in lambda: self._embedding_fn(sample_ids)) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\helper.py", line 579, in lambda ids: embedding_ops.embedding_lookup(embedding, ids)) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\embedding_ops.py", line 317, in embedding_lookup transform_fn=None) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\embedding_ops.py", line 135, in _embedding_lookup_and_transform array_ops.gather(params[0], ids, name=name), ids, max_norm) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\util\dispatch.py", line 180, in wrapper return target(args, kwargs) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\array_ops.py", line 3956, in gather params, indices, axis, name=name) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\gen_array_ops.py", line 4082, in gather_v2 batch_dims=batch_dims, name=name) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func return func(*args, **kwargs) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op attrs, op_def, compute_device) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal op_def=op_def) File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

I have a vocab-size of 75k, and i am trying to train a model with ~10.7 million pairs. I trained a smaller model with around 800k pairs before with no issues. The guy from the other issue report says that it's a memory issue and it's the vault of a too big batch-size and he mentions that a batch-size of 16 worked for him, but even a batch-size of 4 causes this error (at step 40k) in my case which is interesting because he has the same graphics card. I also tried to decrease the vocab-size to 15k, but again same error. Can someone help me? Thanks.

Nathan-Chell commented 4 years ago

When you say ' I also tried to decrease the vocab-size to 15k' did you re-run prepare_data.py ? If not then you need too. The error you are experiencing is, indeed due to your GPU running out of memory. Reducing the batch size will fix this. What is the size of your model?

M0rica commented 4 years ago

I always ran prepare_data.py after changing any settings, and as I said the error occured even with a batch-size of 4 but a few external evaluations later than with higher ones. Edit: I also run a programm called HWInfo64 while training which monitors the usage of all hardware and this said that the gpu-memory never went over 92%, so in theory there should be enough memory, but i don't know how accurat the programm is...

Nathan-Chell commented 4 years ago

HwInfo is very good, quite accurate also. What is the size of your model and what gpu are you training it on?

M0rica commented 4 years ago

I train on a gtx1070ti and what exactly do you mean by size of the model? everything is default except vocab-size=75.000 and i have 10.7 million training pairs (total 3.8gb text files).

Nathan-Chell commented 4 years ago

I mean the amount of neurons and layers you have in your network. You should easily be able to fit the model you have described with a batch size of 4 in 8GB of ram. Are you updating the settings in settings.py and do you have overide existing settings set to true ?

M0rica commented 4 years ago

I got standart model size of num_layers=2 with num_units=512 and override_loaded_hparams=True, I set the settings in settings.py These are all hparams in settings.py: "attention": "scaled_luong", "num_train_steps": 10000000, "num_layers": 2,

"num_encoder_layers": 2,

#"num_decoder_layers": 2,
"num_units": 512,
"batch_size": 4,
"override_loaded_hparams": True,
#"decay_scheme": "luong234"

"residual": True,

"optimizer": "adam",
"encoder_type": "bi",
"learning_rate": 0.001,
"beam_width": 20,
"length_penalty_weight": 1.0,
"num_translations_per_input": 20,
#"num_keep_ckpts": 5,
M0rica commented 4 years ago

Small update: So I tried lots of diffrent settings the last few days, literally NOTHING seems to work: The smallest I tried training was a model with 2 layers, 256 units, vocab-size of 15000 and a batch-size of 1, but even this very low configuration caused the same error. HWInfo interestingly says that no matter what batch-size, vocab-size etc., the gpu memory usage stays the same at around 7.2 GB every single time. Btw i'm using python 3.7.6, tensorflow-gpu 1.15.2 (also tried 1.14 but nothing diffrent, so don't think tensorflow is the problem) with CUDA 10 with cuDNN 7.6.4

Nathan-Chell commented 4 years ago

How much system RAM do you have?

M0rica commented 4 years ago

I have 16GB DDR4 3000mhz ram, the training uses about 4GB, total ram usage is at about 70% during training

M0rica commented 4 years ago

So seems like I fixed the issue, but the solution is not perfekt: I just added to hparams in settings.py 'steps_per_external_eval': 10000000000, which prevents making external evals, this way it works with a batch-size of 16 and a model with 2 layers and 512 units, but i think the bleu score won't be updatet which is not a big problem. Once i stopped the training, it made an external eval on next startup, which again caused a crash. To prevent this, I commented out "run_full_eval" under "#First evaluation" (l. 514) in the train.py inside the nmt folder and it works!