OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.77k stars 2.25k forks source link

Please help: IndexError: list index out of range #1559

Closed mrpega closed 4 years ago

mrpega commented 5 years ago

Hi all helpful folks, I met this error message: IndexError: list index out of range while doing inference: python translate.py -batch_size 2 -model models/_step_X.pt -src dataset/debug_sents.txt -output output/test -beam_size 3 -block_ngram_repeat 2 -replace_unk -seed 888 -shard_size 50000 -gpu -1

The loaded model is just a trained model. The contents of the debug_sents.txt is just 2 lines:

the assyrians first rose around 2,500
i love animals , ' ashley told mailonline .

I suspected that this is related to #584 because It works when batch size is 1.

I have confirmed this error by checking out the latest codes.

Thanks for the help!

vince62s commented 5 years ago

post the full error trace, also try without block_ngram_repeat

mrpega commented 5 years ago

Here's the full error trace:

(base) X-MacBook-Pro-3:OpenNMT-py user$ python translate.py -batch_size 2 -model models/_step_49000.pt -src /Users/user/Documents/abs_summary/abstractor/dataset/debug_sents.txt -output output/abs_test -beam_size 3 -block_ngram_repeat 2 -replace_unk -shard_size 50000 -gpu -1
[2019-09-13 09:08:17,302 INFO] Translating shard 0.
//anaconda3/lib/python3.7/site-packages/torchtext/data/field.py:359: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  var = torch.tensor(arr, dtype=self.dtype, device=device)
Traceback (most recent call last):
  File "translate.py", line 49, in <module>
    main(opt)
  File "translate.py", line 33, in main
    attn_debug=opt.attn_debug
  File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translator.py", line 353, in translate
    translations = xlation_builder.from_batch(batch_data)
  File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 95, in from_batch
    for n in range(self.n_best)]
  File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 95, in <listcomp>
    for n in range(self.n_best)]
  File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 43, in _build_target_tokens
    tokens.append(src_vocab.itos[tok - len(vocab)])
IndexError: list index out of range

Without block_ngram_repeat is fine.

Thanks.

pltrdy commented 5 years ago

Hi @mrpega could you also provide us the preprocessing command?

mrpega commented 5 years ago

Hey @pltrdy thanks for the reply. Here's the preprocessing command:

python preprocess.py -train_src dataset/train_source_sent.txt \
                     -train_tgt dataset/train_target_sent.txt \
                     -valid_src dataset/valid_source_sent.txt \
                     -valid_tgt dataset/valid_target_sent.txt \
                     -save_data dataset/processed/ \
                     -src_seq_length 300 \
                     -dynamic_dict \
                     -share_vocab \
                     -shard_size 100000

Since we are on it, here's the training command:


python -u train.py -data dataset/processed/ \
                   -save_model models/ \
                   -layers 4 \
                   -rnn_size 512 \
                   -word_vec_size 512 \
                   -max_grad_norm 2 \
                   -optim adam \
                   -encoder_type transformer \
                   -decoder_type transformer \
                   -position_encoding \
                   -dropout 0.2 \
                   -warmup_steps 3000 \
                   -learning_rate 2 \
                   -decay_method noam \
                   -label_smoothing 0.1 \
                   -adam_beta2 0.998 \
                   -batch_size 4097 \
                   -batch_type tokens \
                   -normalization tokens \
                   -max_generator_batches 2 \
                   -train_steps 50000 \
                   -accum_count 4 \
                   -share_embeddings \
                   -copy_attn \
                   -param_init_glorot \
                   -reuse_copy_attn \
                   -seed 888 \
                   -report_every 1 \
                   -valid_steps 4 \
                   -log_file logs/abs_expX.log \
                   -save_checkpoint_steps 1000 \
                   -exp expX

Thanks again

pltrdy commented 5 years ago

Well, that's a bit weird. Let's inspect what is going on. You can replace onmt/translate/translation.py:43 (which is crashing) to include some debugging e.g.

                src_tok = tok - len(vocab)
                print("len src vocab: %d" % len(src_vocab))
                print("len tgt vocab: %d" % len(tgt_vocab))
                print("token: %d, (w/o copy: %d)" % (tok, src_tok))
                tokens.append(src_vocab.itos[tok - len(vocab)])

The output will be super verbose (which may slow the translation process) just wait for it to crash and report those values.

mrpega commented 5 years ago

@pltrdy is it working well for you?

I have place the debugging codes and it throws out:

(base) X-MacBook-Pro-3:OpenNMT-py user$ python translate.py -batch_size 2 -model models/_step_49000.pt -src /Users/user/Documents/abs_summary/abstractor/dataset/debug_sents.txt -output output/abs_pv_ai_test -beam_size 3 -block_ngram_repeat 2 -replace_unk -seed 888 -shard_size 50000 -gpu -1
[2019-09-13 17:39:28,762 INFO] Translating shard 0.
//anaconda3/lib/python3.7/site-packages/torchtext/data/field.py:359: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  var = torch.tensor(arr, dtype=self.dtype, device=device)
len src vocab: 8
len tgt vocab: 50004
token: 50008, (w/o copy: 4)
len src vocab: 8
len tgt vocab: 50004
token: 50008, (w/o copy: 4)
len src vocab: 8
len tgt vocab: 50004
token: 50014, (w/o copy: 10)
Traceback (most recent call last):
  File "translate.py", line 49, in <module>
    main(opt)
  File "translate.py", line 33, in main
    attn_debug=opt.attn_debug
  File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translator.py", line 353, in translate
    translations = xlation_builder.from_batch(batch_data)
  File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 100, in from_batch
    for n in range(self.n_best)]
  File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 100, in <listcomp>
    for n in range(self.n_best)]
  File "/Users/user/Documents/abs_summary/reproduce2/OpenNMT-py/onmt/translate/translation.py", line 48, in _build_target_tokens
    tokens.append(src_vocab.itos[tok - len(vocab)])
IndexError: list index out of range
pltrdy commented 5 years ago

@mrpega I've been experimenting in similar setup i.e. summarization w/ copy_attn and -block_ngram_repeat without encountering this issue at any moment.

pltrdy commented 5 years ago

From my understanding, this error occurs when the prediction is an out-of-vocabulary, that is copied thanks to the copy mechanism (which explain that he's not in tgt_vocab).

However, I checked on my own models, it turns out that (i) I'm facing similar cases i.e. copying OOV: the code at this line AND produce a token that is not in the vocabulary, (ii) this token perfectly make sense, i.e. the copy-mecanism isn't broken (even for OOV). That's a good news for me, but still don't help me understand your problem.

You could eventually print:

My only "idea" would be that maybe, somehow, the token is refering to the id in the sentence, but the vocabulary is counting UNIQUE words. If there's no word repetition in your sentence, it breaks my theory.

mrpega commented 5 years ago

@pltrdy Thank you so much for spending the time to look at it. I really appreciate it.

It's kind of weird to me as well. This happens very rarely though and at the moment I can get around it by setting batch size to 1.

Please don't close off this yet though, I'm still investigating it.

Thanks!

mrpega commented 5 years ago

@pltrdy Today I had a few couple hours to debug in granular level and I think it confirms my suspicions that it's due to batch_size> 1.

Case in point: 2 sentences: eg1: ["the", "assyrians", "first", "rose" ,"around" ,"2,500"] 6 tokens eg2: ["i", "love" , "animals", ",", "'", "ashley", "told", "mailonline", "."] 9 tokens

When the "src_map" is constructed (onmt/inputters/dataset_base.py:51) based on src's extended vocab, it looks fine(on it's own, batch_size=1).

eg1's src_map shape: 6x1x8 (8 because 6 org+2 for unk and pad tokens) tensor( [[[0., 0., 0., 0., 0., 0., 0., 1.]], [[0., 0., 0., 0., 1., 0., 0., 0.]], [[0., 0., 0., 0., 0., 1., 0., 0.]], [[0., 0., 0., 0., 0., 0., 1., 0.]], [[0., 0., 0., 1., 0., 0., 0., 0.]], [[0., 0., 1., 0., 0., 0., 0., 0.]]])

eg2's src_map shape: 9x1x11 (11 because 9 org+2 for unk and pad tokens) tensor( [[[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]], [[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]], [[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]], [[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]], [[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]]])

The problem comes when you try to use batch_size=2 (more than 1), because obviously, it's going to be padded to the larger tensor:

batch.src_map shape: 9 x 2 x11 tensor( [[[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]], [[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]], [[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])

note that 6 is here because I have used a beam size of 3 together with batch size 2.

Eventually this src_map is used in computing the "copy_prob"(6x11), and the copy_prob was concatenated with the tgt's vocab(onmt/modules/copy_generator.py:132): 6x50004 concat with 6x11 along dim 1 = 6 x50015

At this point I was wondering if it's possible for eg1(only 6 tokens) to select the padded tokens under very slim chance (probability of padded tokens is higher than normal tokens)

Fast forward to the _build_target_tokens function that we have working on, at onmt/translate/translation.py:43. I added a few more debugging lines such as:

    def _build_target_tokens(self, src: List, src_vocab: torchtext.vocab.Vocab,
                             src_raw: List, pred: List, attn: List) -> List[str]:
        tgt_field = self.fields.tgt.base_field
        vocab = tgt_field.vocab
        tokens = []
        print('@_build_target_tokens now::')
        print('src vocab:',src_vocab.itos)
        print('pred:',len(pred), pred)
        for tok in pred:
            print('at toke:',tok,'now')
            if tok < len(vocab):
                print('token exists in target vocab..')
                tokens.append(vocab.itos[tok])
            else:
                print('token DOESNt exists in target vocab..')
                print('src vocab:', src_vocab.itos)
                print('append:',src_vocab.itos[tok - len(vocab)], 'from src vocab on id',tok - len(vocab))
                tokens.append(src_vocab.itos[tok - len(vocab)])

            if tokens[-1] == tgt_field.eos_token:
                tokens = tokens[:-1]
                break

        if attn is not None and src is not None:
            for i in range(len(tokens)):
                if tokens[i] == tgt_field.unk_token:
                    _, max_index = attn[i][:len(src_raw)].max(0)
                    tokens[i] = src_raw[max_index.item()]
        print('\n\n')
        return tokens

Then it crashed because if query the index(50014) that is out of the eg1's src vocab. The maximium of eg1 is only (50004+8 = 50012). And that sort of confirm my suspicoon that there's a slim chance where a padded token can be selected due to certain distribution of probabilites. And when this padded token was being used to query the src's vocab, it crashed. Does that make sense?

Thanks!

@_build_target_tokens now::
src vocab: ['<unk>', '<blank>', '2,500', 'around', 'assyrians', 'first', 'rose', 'the']
pred: 100 tensor([    4, 50008,    75,  1732,   141,  5613,  2893,   141,     4, 50008,
        50014,     0, 50014,     0, 50014,     0, 50014,     0, 50014,     0,
        50014,     0, 50014,     0, 50014,     0, 50014,     0, 50014,     0,
        50014,     0, 50014,     0, 50014,     0, 50014,     0, 50014,     0,
        50014,     0, 50014,     0, 50014,     0, 50014,     0, 50014,     0,
        50014,     0, 50014,     0, 50014,     0, 50014,     0, 50014,     0,
        50014,     0, 50014,     0, 50014,     0, 50014,     0, 50014,     0,
        50014,     0, 50014,     0, 50014,     0, 50014,     0, 50014,     0,
        50014,     0, 50014,     0, 50014,     0, 50014,     0, 50014,     0,
        50014,     0, 50014,     0, 50014,     0, 50014,     0, 50014,     0])
at toke: tensor(4) now
token exists in target vocab..
at toke: tensor(50008) now
token DOESNt exists in target vocab..
src vocab: ['<unk>', '<blank>', '2,500', 'around', 'assyrians', 'first', 'rose', 'the']
append: assyrians from src vocab on id tensor(4)
at toke: tensor(75) now
token exists in target vocab..
at toke: tensor(1732) now
token exists in target vocab..
at toke: tensor(141) now
token exists in target vocab..
at toke: tensor(5613) now
token exists in target vocab..
at toke: tensor(2893) now
token exists in target vocab..
at toke: tensor(141) now
token exists in target vocab..
at toke: tensor(4) now
token exists in target vocab..
at toke: tensor(50008) now
token DOESNt exists in target vocab..
src vocab: ['<unk>', '<blank>', '2,500', 'around', 'assyrians', 'first', 'rose', 'the']
append: assyrians from src vocab on id tensor(4)
at toke: tensor(50014) now
pltrdy commented 5 years ago

At this point I was wondering if it's possible for eg1(only 6 tokens) to select the padded tokens under very slim chance (probability of padded tokens is higher than normal tokens)

This is prevented by copy_generator.py:119.

chillaxkrish commented 3 years ago

Please help me how to resolve this?

image

francoishernandez commented 3 years ago

@chillaxkrish looks like you're on the wrong repo: https://github.com/fawazsammani/knowing-when-to-look-adaptive-attention