THUNLP-MT / THUMT

An open-source neural machine translation toolkit developed by Tsinghua Natural Language Processing Group
BSD 3-Clause "New" or "Revised" License
703 stars 197 forks source link

The error of training an RNNsearch model #30

Closed Julisa-test closed 6 years ago

Julisa-test commented 6 years ago

There is my command as follow . What cause the error and how to solve it ? (venv-2.7.14) ubuntu@ubuntu:~/python2.7/tensorflow$ python THUMT/thumt/bin/trainer.py --input corpus.tc.32k.de.shuf corpus.tc.32k.en.shuf --vocabulary vocab.32k.de.txt vocab.32k.en.txt --model rnnsearch --validation newstest2014.tc.32k.de --references newstest2014.tc.32k.en --parameters=batch_size=128,device_list=[0],train_steps=200000 INFO:tensorflow:Restoring hyper parameters from /home/ubuntu/python2.7/tensorflow/train/params.json Traceback (most recent call last): File "THUMT/thumt/bin/trainer.py", line 472, in <module> main(parse_args()) File "THUMT/thumt/bin/trainer.py", line 317, in main params = import_params(args.output, args.model, params) File "THUMT/thumt/bin/trainer.py", line 122, in import_params params.parse_json(json_str) File "/home/ubuntu/.pyenv/versions/venv-2.7.14/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/hparam.py", line 587, in parse_json return self.override_from_dict(values_map) File "/home/ubuntu/.pyenv/versions/venv-2.7.14/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/hparam.py", line 539, in override_from_dict self.set_hparam(name, value) File "/home/ubuntu/.pyenv/versions/venv-2.7.14/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/hparam.py", line 490, in set_hparam param_type, is_list = self._hparam_types[name] KeyError: u'num_hidden_layers' Thanks in advance .

Playinf commented 6 years ago

The num_hidden_layers is a hyper-parameter specific to seq2seq architecture. This error should not be happened unless you have renamed seq2seq.json to rnnsearch.json. I think you should delete the train directory and try the command again.

Julisa-test commented 6 years ago

I just downloaded all of the THUMT packages from GitHub and then tried to reproduce your experiments in accordance with the user manual. The training files were generated automatically after running the code in the user manual 3.2.1. I did not rename any files. Just after running the code in 3.2.2 and then running the code in 3.2.2, there was an error. I use python2.7.0, Tensorflow1.6.0, is the version I use wrong?

Playinf commented 6 years ago

That's weird. I have tested the latest commit of THUMT and do not found this problem. Have you tried deleting train directory and re-run the command?

Julisa-test commented 6 years ago

Hi, thanks for your reply!

I tried deleting train directory and re-run the command, but now I get this error:

2018-05-01 15:30:16.636478: W tensorflow/core/common_runtime/bfc_allocator.cc:279] **************************************************************************************************xx 2018-05-01 15:30:16.647710: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS 2018-05-01 15:30:16.647734: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 2018-05-01 15:30:16.647724: E tensorflow/core/common_runtime/bfc_allocator.cc:381] tried to deallocate nullptr 2018-05-01 15:30:16.647755: E tensorflow/core/common_runtime/bfc_allocator.cc:381] tried to deallocate nullptr Aborted (core dumped) nvidia driver version: 390.30 CUDA Version 9.0.176 cudnn7_7.1.3.16 gcc version 5.4.0

Should i ignore this?

Playinf commented 6 years ago

It seems that you have run out of GPU memory. Try to reduce the batch_size hyper-parameter.

Julisa-test commented 6 years ago

Thanks very much, it finally works!