Closed spiderwisp closed 6 years ago
Did you move that over filesystem (path changed)?
What do you mean by after training my model
? How many records in corpus (training set), and how many steps in training?
Sorry about that, after executing train.py
is what I should have said.
I am using your sample records for training and all default parameters in settings.py
. I wish I could elaborate more but I'm very new to tensorflow and python.
If you run train.py
you should see something like:
global step 263100 lr 0.001 step-time 2.31s wps 1.90K ppl 74.88 gN 11.67 bleu 0.00
Tell me that global step please.
This seems to be the root of my problem perhaps.
This is the output from train.py
it seems to be failing.
`# Job id 0
saving hparams to /home/science/tf-demo/models/nmt-chatbot/model/hparams
saving hparams to /home/science/tf-demo/models/nmt-chatbot/model/best_bleu/hparams
attention=scaled_luong
attention_architecture=standard
batch_size=128
beam_width=10
best_bleu=0
best_bleu_dir=/home/science/tf-demo/models/nmt-chatbot/model/best_bleu
check_special_token=True
colocate_gradients_with_ops=True
decay_factor=1.0
decay_steps=10000
dev_prefix=/home/science/tf-demo/models/nmt-chatbot/data/tst2012
dropout=0.2
encoder_type=bi
eos=
epoch_step=0
forget_bias=1.0
infer_batch_size=32
init_op=uniform
init_weight=0.1
learning_rate=0.001
learning_rate_decay_scheme=
length_penalty_weight=1.0
log_device_placement=False
max_gradient_norm=5.0
max_train=0
metrics=['bleu']
num_buckets=5
num_embeddings_partitions=0
num_gpus=1
num_layers=2
num_residual_layers=0
num_train_steps=500000
num_translations_per_input=10
num_units=512
optimizer=adam
out_dir=/home/science/tf-demo/models/nmt-chatbot/model
output_attention=True
override_loaded_hparams=True
pass_hidden_state=True
random_seed=None
residual=False
share_vocab=False
sos=<s>
source_reverse=False
src=from
src_max_len=50
src_max_len_infer=None
src_vocab_file=/home/science/tf-demo/models/nmt-chatbot/data/vocab.from
src_vocab_size=15003
start_decay_step=0
steps_per_external_eval=None
steps_per_stats=100
subword_option=
test_prefix=/home/science/tf-demo/models/nmt-chatbot/data/tst2013
tgt=to
tgt_max_len=50
tgt_max_len_infer=None
tgt_vocab_file=/home/science/tf-demo/models/nmt-chatbot/data/vocab.to
tgt_vocab_size=15003
time_major=True
train_prefix=/home/science/tf-demo/models/nmt-chatbot/data/train
unit_type=lstm
vocab_prefix=/home/science/tf-demo/models/nmt-chatbot/data/vocab
warmup_scheme=t2t
warmup_steps=0
num_bi_layers = 1, num_bi_residual_layers=0 cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 learning_rate=0.001, warmup_steps=0, warmup_scheme=t2t decay_scheme=, start_decay_step=0, decay_steps 10000, decay_factor 1
embeddings/encoder/embedding_encoder:0, (15003, 512), embeddings/decoder/embedding_decoder:0, (15003, 512), dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0 dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0 dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/decoder/memory_layer/kernel:0, (1024, 512), dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (1536, 2048), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0 dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (1536, 512), /device:GPU:0 dynamic_seq2seq/decoder/output_projection/kernel:0, (512, 15003), /device:GPU:0
num_bi_layers = 1, num_bi_residual_layers=0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
embeddings/encoder/embedding_encoder:0, (15003, 512), embeddings/decoder/embedding_decoder:0, (15003, 512), dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0 dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0 dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/decoder/memory_layer/kernel:0, (1024, 512), dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (1536, 2048), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0 dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (1536, 512), /device:GPU:0 dynamic_seq2seq/decoder/output_projection/kernel:0, (512, 15003), /device:GPU:0
num_bi_layers = 1, num_bi_residual_layers=0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
embeddings/encoder/embedding_encoder:0, (15003, 512), embeddings/decoder/embedding_decoder:0, (15003, 512), dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0 dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0 dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/decoder/memory_layer/kernel:0, (1024, 512), dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (1536, 2048), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (2048,), /device:GPU:0 dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0 dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (1536, 512), /device:GPU:0 dynamic_seq2seq/decoder/output_projection/kernel:0, (512, 15003),
2018-02-26 18:34:36.090546: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA Killed`
Looks like it's killed for some reason, but no further info here. I'm not sure why. Maybe you were running out of RAM? Maybe look at syslog
.
Thanks. I'll do some more digging and post back here with the results.
Ok so I upgraded the box with more RAM and reboot and it gets me much further, but now aborts
`2018-02-26 19:00:30.235705: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA created train model with fresh parameters, time 16.37s created infer model with fresh parameters, time 5.69s
src: yeah , when they use texture it just makes it grosser
ref: Creamy is one of those words that belongs as far from sex as possible .
nmt: bounce EQ EQ parole parole parole TALK Klopp TALK TALK TALK spices spices spices streets streets streets streets streets streets streets streets
created eval model with fresh parameters, time 4.98s eval dev: perplexity 16152.77, time 21s, Mon Feb 26 19:01:23 2018. eval test: perplexity 16152.77, time 16s, Mon Feb 26 19:01:40 2018. 2018-02-26 19:01:43.591896: W tensorflow/core/kernels/lookup_util.cc:362] Table trying to initialize from file /home/science/tf-demo/models/nmt-chatbot/data/vocab.to is already initialized. 2018-02-26 19:01:43.593025: W tensorflow/core/kernels/lookup_util.cc:362] Table trying to initialize from file /home/science/tf-demo/models/nmt-chatbot/data/vocab.to is already initialized. 2018-02-26 19:01:43.593982: W tensorflow/core/kernels/lookup_util.cc:362] Table trying to initialize from file /home/science/tf-demo/models/nmt-chatbot/data/vocab.from is already initialized. created infer model with fresh parameters, time 3.01s
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)`
How much RAM now in that machine? bad_alloc might mean still not sufficient RAM, or some RAM issues.
So I upgraded again to a box with 16GB RAM and 4 CPUs which rectified the above problems.
So now my model is training and this is the latest output
global step 100 lr 0.001 step-time 21.34s wps 0.28K ppl 1197.55 gN 30.43 bleu 0.00
However, when I try to enter input when running inference.py
I'm presented with the following error:
Starting interactive mode (first response will take a while)
test
2018-02-26 23:19:43.794421: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Traceback (most recent call last):
File "inference.py", line 277, in <module>
answers = process_questions(question)[0]
File "inference.py", line 238, in process_questions
answers_list = inference_helper(prepared_questions)
File "inference.py", line 174, in start_inference
return inference_helper(question)
File "inference.py", line 167, in <lambda>
inference_helper = lambda question: do_inference(question, *inference_object)
File "inference.py", line 91, in do_inference
loaded_infer_model = nmt.inference.model_helper.load_model(infer_model.model, flags.ckpt, sess, "infer")
File "/home/science/nmt-chatbot/nmt/nmt/model_helper.py", line 465, in load_model
model.saver.restore(session, ckpt)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1682, in restore
raise ValueError("Can't load save_path when it is None.")
ValueError: Can't load save_path when it is None.
science@machine-learning-and-s-6vcpu-16gb-nyc1-01:~/nmt-chatbot$
Is this because the model hasn't trained enough?
Yes, you have to wait for first checkpoint to be saved. You'll know that's that moment, because you'll see information on your console. There will be some evaluation as well. To train a model you should wait for at least two full epochs (info will be printed to a console as well).
Thank you. I went through some of the tutorials again and I think after 5,000 steps this checkpoint and output_dev file is generated?
This is my current output
global step 100 lr 0.001 step-time 21.34s wps 0.28K ppl 1197.55 gN 30.43 bleu 0.00
global step 200 lr 0.001 step-time 20.92s wps 0.29K ppl 530.19 gN 13.81 bleu 0.00
global step 300 lr 0.001 step-time 20.00s wps 0.30K ppl 318.84 gN 11.21 bleu 0.00
global step 400 lr 0.001 step-time 20.37s wps 0.30K ppl 256.20 gN 10.01 bleu 0.00
Yes, every 5k steps, but also at epoch end. You can calculate how many steps for an epoch by dividing number of entries in corpus (training set) by batch size (128 by default, configurable in setup/settings.py).
Thank you for all your help (do you guys have a donate link?)
If anyone had similar problems to mine, the root of my issues turned out to be package problems between python versions.
Also, I had to reinstall tensorflow
after installing the packages from requirements.txt
. Not sure why.
I'm glad you solved your issues.
And link you were asking for: https://pythonprogramming.net/support/
Instead of global step I am getting step is that a prblem? Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)] on win32 Type "copyright", "credits" or "license()" for more information.
RESTART: C:\Users\kumar\Desktop\finalyear_project\nmt-chatbot-master\train.py
[32mTraining model...[39m
[32mEpoch: 1, steps per epoch: 7947, epoch ends at 7947 steps, learning rate: 0.001 - training[39m
using source vocab for target
saving hparams to model/hparams
saving hparams to model/best_bleu\hparams
attention=scaled_luong
attention_architecture=standard
avg_ckpts=False
batch_size=128
beam_width=20
best_bleu=0
best_bleu_dir=model/best_bleu
check_special_token=True
colocate_gradients_with_ops=True
decay_scheme=
dev_prefix=data/tst2012.bpe
dropout=0.2
embed_prefix=None
encoder_type=bi
eos=
epoch_step=0
forget_bias=1.0
infer_batch_size=32
infer_mode=greedy
init_op=uniform
init_weight=0.1
language_model=False
learning_rate=0.001
length_penalty_weight=1.0
log_device_placement=False
max_gradient_norm=5.0
max_train=0
metrics=['bleu']
num_buckets=5
num_dec_emb_partitions=0
num_decoder_layers=2
num_decoder_residual_layers=0
num_embeddings_partitions=0
num_enc_emb_partitions=0
num_encoder_layers=2
num_encoder_residual_layers=0
num_gpus=1
num_inter_threads=0
num_intra_threads=0
num_keep_ckpts=5
num_sampled_softmax=0
num_train_steps=7947
num_translations_per_input=20
num_units=512
optimizer=adam
out_dir=model/
output_attention=True
override_loaded_hparams=True
pass_hidden_state=True
random_seed=None
residual=False
sampling_temperature=0.0
share_vocab=True
sos=
src=from
src_embed_file=
src_max_len=50
src_max_len_infer=None
src_vocab_file=data/vocab.bpe.from
src_vocab_size=15003
steps_per_external_eval=None
steps_per_stats=100
subword_option=spm
test_prefix=data/tst2013.bpe
tgt=to
tgt_embed_file=
tgt_max_len=50
tgt_max_len_infer=None
tgt_vocab_file=data/vocab.bpe.from
tgt_vocab_size=15003
time_major=True
train_prefix=data/train.bpe
unit_type=lstm
use_char_encode=False
vocab_prefix=data/vocab.bpe
warmup_scheme=t2t
warmup_steps=0
num_bi_layers = 1, num_bi_residual_layers=0 cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 learning_rate=0.001, warmup_steps=0, warmup_scheme=t2t decay_scheme=, start_decay_step=7947, decay_steps 0, decay_factor 1
Format:
num_bi_layers = 1, num_bi_residual_layers=0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
Format:
num_bi_layers = 1, num_bi_residual_layers=0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 decoder: infer_mode=greedybeam_width=20, length_penalty=1.000000
Format:
created train model with fresh parameters, time 2.17s created infer model with fresh parameters, time 0.54s
src: ▁< met a ▁name = " br ows er - err ors - ur l " ▁content = " https : / / ap i . github . com / _ p riv ate / br ows er / err ors " >
ref: ▁< met a ▁name = " br ows er - err ors - ur l " ▁content = " https : / / ap i . github . com / _ p riv ate / br ows er / err ors " >
nmt: til cruel cruelGHGHGHGHཻཻཻholholholhol tons tons tons tons tons tons tonsaughaughaughaughaughaughaughaughaughaugh Key Key Key Key Key Key Key Key Key tier tier tier tier tier tier tierpadpadpadpad Af Af bri bri bri Afoman Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af Af
created eval model with fresh parameters, time 0.60s eval dev: perplexity 18138.46, time 24s, Tue Oct 2 09:20:45 2018. eval test: perplexity 18254.26, time 24s, Tue Oct 2 09:21:10 2018. created infer model with fresh parameters, time 0.55s
step 100 lr 0.001 step-time 25.89s wps 0.23K ppl 1151.96 gN 27.21 bleu 0.00, Tue Oct 2 10:04:21 2018
Perhaps I somehow have my training files in the wrong location? After training my model and running inference.py I get the following error.
`Starting interactive mode (first response will take a while):