Closed mrpega closed 4 years ago
post the full error trace, also try without block_ngram_repeat
Here's the full error trace:
(base) X-MacBook-Pro-3:OpenNMT-py user$ python translate.py -batch_size 2 -model models/_step_49000.pt -src /Users/user/Documents/abs_summary/abstractor/dataset/debug_sents.txt -output output/abs_test -beam_size 3 -block_ngram_repeat 2 -replace_unk -shard_size 50000 -gpu -1
[2019-09-13 09:08:17,302 INFO] Translating shard 0.
//anaconda3/lib/python3.7/site-packages/torchtext/data/field.py:359: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
var = torch.tensor(arr, dtype=self.dtype, device=device)
Traceback (most recent call last):
File "translate.py", line 49, in <module>
main(opt)
File "translate.py", line 33, in main
attn_debug=opt.attn_debug
File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translator.py", line 353, in translate
translations = xlation_builder.from_batch(batch_data)
File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 95, in from_batch
for n in range(self.n_best)]
File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 95, in <listcomp>
for n in range(self.n_best)]
File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 43, in _build_target_tokens
tokens.append(src_vocab.itos[tok - len(vocab)])
IndexError: list index out of range
Without block_ngram_repeat is fine.
Thanks.
Hi @mrpega could you also provide us the preprocessing command?
Hey @pltrdy thanks for the reply. Here's the preprocessing command:
python preprocess.py -train_src dataset/train_source_sent.txt \
-train_tgt dataset/train_target_sent.txt \
-valid_src dataset/valid_source_sent.txt \
-valid_tgt dataset/valid_target_sent.txt \
-save_data dataset/processed/ \
-src_seq_length 300 \
-dynamic_dict \
-share_vocab \
-shard_size 100000
Since we are on it, here's the training command:
python -u train.py -data dataset/processed/ \
-save_model models/ \
-layers 4 \
-rnn_size 512 \
-word_vec_size 512 \
-max_grad_norm 2 \
-optim adam \
-encoder_type transformer \
-decoder_type transformer \
-position_encoding \
-dropout 0.2 \
-warmup_steps 3000 \
-learning_rate 2 \
-decay_method noam \
-label_smoothing 0.1 \
-adam_beta2 0.998 \
-batch_size 4097 \
-batch_type tokens \
-normalization tokens \
-max_generator_batches 2 \
-train_steps 50000 \
-accum_count 4 \
-share_embeddings \
-copy_attn \
-param_init_glorot \
-reuse_copy_attn \
-seed 888 \
-report_every 1 \
-valid_steps 4 \
-log_file logs/abs_expX.log \
-save_checkpoint_steps 1000 \
-exp expX
Thanks again
Well, that's a bit weird. Let's inspect what is going on. You can replace onmt/translate/translation.py:43
(which is crashing) to include some debugging e.g.
src_tok = tok - len(vocab)
print("len src vocab: %d" % len(src_vocab))
print("len tgt vocab: %d" % len(tgt_vocab))
print("token: %d, (w/o copy: %d)" % (tok, src_tok))
tokens.append(src_vocab.itos[tok - len(vocab)])
The output will be super verbose (which may slow the translation process) just wait for it to crash and report those values.
@pltrdy is it working well for you?
I have place the debugging codes and it throws out:
(base) X-MacBook-Pro-3:OpenNMT-py user$ python translate.py -batch_size 2 -model models/_step_49000.pt -src /Users/user/Documents/abs_summary/abstractor/dataset/debug_sents.txt -output output/abs_pv_ai_test -beam_size 3 -block_ngram_repeat 2 -replace_unk -seed 888 -shard_size 50000 -gpu -1
[2019-09-13 17:39:28,762 INFO] Translating shard 0.
//anaconda3/lib/python3.7/site-packages/torchtext/data/field.py:359: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
var = torch.tensor(arr, dtype=self.dtype, device=device)
len src vocab: 8
len tgt vocab: 50004
token: 50008, (w/o copy: 4)
len src vocab: 8
len tgt vocab: 50004
token: 50008, (w/o copy: 4)
len src vocab: 8
len tgt vocab: 50004
token: 50014, (w/o copy: 10)
Traceback (most recent call last):
File "translate.py", line 49, in <module>
main(opt)
File "translate.py", line 33, in main
attn_debug=opt.attn_debug
File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translator.py", line 353, in translate
translations = xlation_builder.from_batch(batch_data)
File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 100, in from_batch
for n in range(self.n_best)]
File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 100, in <listcomp>
for n in range(self.n_best)]
File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 48, in _build_target_tokens
tokens.append(src_vocab.itos[tok - len(vocab)])
IndexError: list index out of range
@mrpega I've been experimenting in similar setup i.e. summarization w/ copy_attn and -block_ngram_repeat without encountering this issue at any moment.
From my understanding, this error occurs when the prediction is an out-of-vocabulary, that is copied thanks to the copy mechanism (which explain that he's not in tgt_vocab).
However, I checked on my own models, it turns out that (i) I'm facing similar cases i.e. copying OOV: the code at this line AND produce a token that is not in the vocabulary, (ii) this token perfectly make sense, i.e. the copy-mecanism isn't broken (even for OOV). That's a good news for me, but still don't help me understand your problem.
You could eventually print:
My only "idea" would be that maybe, somehow, the token is refering to the id in the sentence, but the vocabulary is counting UNIQUE words. If there's no word repetition in your sentence, it breaks my theory.
@pltrdy Thank you so much for spending the time to look at it. I really appreciate it.
It's kind of weird to me as well. This happens very rarely though and at the moment I can get around it by setting batch size to 1.
Please don't close off this yet though, I'm still investigating it.
Thanks!
@pltrdy Today I had a few couple hours to debug in granular level and I think it confirms my suspicions that it's due to batch_size> 1.
Case in point: 2 sentences: eg1: ["the", "assyrians", "first", "rose" ,"around" ,"2,500"] 6 tokens eg2: ["i", "love" , "animals", ",", "'", "ashley", "told", "mailonline", "."] 9 tokens
When the "src_map" is constructed (onmt/inputters/dataset_base.py:51) based on src's extended vocab, it looks fine(on it's own, batch_size=1).
eg1's src_map shape: 6x1x8 (8 because 6 org+2 for unk and pad tokens) tensor( [[[0., 0., 0., 0., 0., 0., 0., 1.]], [[0., 0., 0., 0., 1., 0., 0., 0.]], [[0., 0., 0., 0., 0., 1., 0., 0.]], [[0., 0., 0., 0., 0., 0., 1., 0.]], [[0., 0., 0., 1., 0., 0., 0., 0.]], [[0., 0., 1., 0., 0., 0., 0., 0.]]])
eg2's src_map shape: 9x1x11 (11 because 9 org+2 for unk and pad tokens) tensor( [[[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]], [[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]], [[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]], [[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]], [[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]]])
The problem comes when you try to use batch_size=2 (more than 1), because obviously, it's going to be padded to the larger tensor:
batch.src_map shape: 9 x 2 x11 tensor( [[[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]], [[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]], [[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])
note that 6 is here because I have used a beam size of 3 together with batch size 2.
Eventually this src_map is used in computing the "copy_prob"(6x11), and the copy_prob was concatenated with the tgt's vocab(onmt/modules/copy_generator.py:132): 6x50004 concat with 6x11 along dim 1 = 6 x50015
At this point I was wondering if it's possible for eg1(only 6 tokens) to select the padded tokens under very slim chance (probability of padded tokens is higher than normal tokens)
Fast forward to the _build_target_tokens function that we have working on, at onmt/translate/translation.py:43. I added a few more debugging lines such as:
def _build_target_tokens(self, src: List, src_vocab: torchtext.vocab.Vocab,
src_raw: List, pred: List, attn: List) -> List[str]:
tgt_field = self.fields.tgt.base_field
vocab = tgt_field.vocab
tokens = []
print('@_build_target_tokens now::')
print('src vocab:',src_vocab.itos)
print('pred:',len(pred), pred)
for tok in pred:
print('at toke:',tok,'now')
if tok < len(vocab):
print('token exists in target vocab..')
tokens.append(vocab.itos[tok])
else:
print('token DOESNt exists in target vocab..')
print('src vocab:', src_vocab.itos)
print('append:',src_vocab.itos[tok - len(vocab)], 'from src vocab on id',tok - len(vocab))
tokens.append(src_vocab.itos[tok - len(vocab)])
if tokens[-1] == tgt_field.eos_token:
tokens = tokens[:-1]
break
if attn is not None and src is not None:
for i in range(len(tokens)):
if tokens[i] == tgt_field.unk_token:
_, max_index = attn[i][:len(src_raw)].max(0)
tokens[i] = src_raw[max_index.item()]
print('\n\n')
return tokens
Then it crashed because if query the index(50014) that is out of the eg1's src vocab. The maximium of eg1 is only (50004+8 = 50012). And that sort of confirm my suspicoon that there's a slim chance where a padded token can be selected due to certain distribution of probabilites. And when this padded token was being used to query the src's vocab, it crashed. Does that make sense?
Thanks!
@_build_target_tokens now::
src vocab: ['<unk>', '<blank>', '2,500', 'around', 'assyrians', 'first', 'rose', 'the']
pred: 100 tensor([ 4, 50008, 75, 1732, 141, 5613, 2893, 141, 4, 50008,
50014, 0, 50014, 0, 50014, 0, 50014, 0, 50014, 0,
50014, 0, 50014, 0, 50014, 0, 50014, 0, 50014, 0,
50014, 0, 50014, 0, 50014, 0, 50014, 0, 50014, 0,
50014, 0, 50014, 0, 50014, 0, 50014, 0, 50014, 0,
50014, 0, 50014, 0, 50014, 0, 50014, 0, 50014, 0,
50014, 0, 50014, 0, 50014, 0, 50014, 0, 50014, 0,
50014, 0, 50014, 0, 50014, 0, 50014, 0, 50014, 0,
50014, 0, 50014, 0, 50014, 0, 50014, 0, 50014, 0,
50014, 0, 50014, 0, 50014, 0, 50014, 0, 50014, 0])
at toke: tensor(4) now
token exists in target vocab..
at toke: tensor(50008) now
token DOESNt exists in target vocab..
src vocab: ['<unk>', '<blank>', '2,500', 'around', 'assyrians', 'first', 'rose', 'the']
append: assyrians from src vocab on id tensor(4)
at toke: tensor(75) now
token exists in target vocab..
at toke: tensor(1732) now
token exists in target vocab..
at toke: tensor(141) now
token exists in target vocab..
at toke: tensor(5613) now
token exists in target vocab..
at toke: tensor(2893) now
token exists in target vocab..
at toke: tensor(141) now
token exists in target vocab..
at toke: tensor(4) now
token exists in target vocab..
at toke: tensor(50008) now
token DOESNt exists in target vocab..
src vocab: ['<unk>', '<blank>', '2,500', 'around', 'assyrians', 'first', 'rose', 'the']
append: assyrians from src vocab on id tensor(4)
at toke: tensor(50014) now
At this point I was wondering if it's possible for eg1(only 6 tokens) to select the padded tokens under very slim chance (probability of padded tokens is higher than normal tokens)
This is prevented by copy_generator.py:119.
Please help me how to resolve this?
@chillaxkrish looks like you're on the wrong repo: https://github.com/fawazsammani/knowing-when-to-look-adaptive-attention
Hi all helpful folks, I met this error message:
IndexError: list index out of range
while doing inference:python translate.py -batch_size 2 -model models/_step_X.pt -src dataset/debug_sents.txt -output output/test -beam_size 3 -block_ngram_repeat 2 -replace_unk -seed 888 -shard_size 50000 -gpu -1
The loaded model is just a trained model. The contents of the debug_sents.txt is just 2 lines:
I suspected that this is related to #584 because It works when batch size is 1.
I have confirmed this error by checking out the latest codes.
Thanks for the help!