OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.75k stars 2.25k forks source link

'Field' object has no attribute 'vocab' #1767

Closed ziyanyang closed 4 years ago

ziyanyang commented 4 years ago

Hi,

I'm trying to do domain adaptation as described here(https://github.com/OpenNMT/OpenNMT-py/issues/768). I want to finetune a pre-trained model (from https://opennmt.net/Models-py/) using multi30k and follow the instructions to pre-process the data here(https://opennmt.net/OpenNMT-py/extended.html). However, when I retrain the model using: CUDA_VISIBLE_DEVICES=1,2 python train.py -world_size 2 -gpu_ranks 0 1 -batch_size 64 -encoder_type brnn -rnn_size 500 -save_model available_models/multi30k_finetune -data data/multi30k.atok.low -reset_optim keep_states -train_from available_models/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt -learning_rate 0.1

I get the error as: [2020-03-31 23:34:16,775 INFO] Loading dataset from data/multi30k.atok.low.train.0.pt [2020-03-31 23:34:17,091 INFO] number of examples: 29000 [2020-03-31 23:34:17,474 INFO] Loading checkpoint from available_models/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt [2020-03-31 23:34:17,771 INFO] Loading vocab from checkpoint at available_models/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt. [2020-03-31 23:34:17,771 INFO] src vocab size = 35444 [2020-03-31 23:34:17,771 INFO] tgt vocab size = 24725 [2020-03-31 23:34:17,771 INFO] Building model... Process SpawnProcess-3: Traceback (most recent call last): File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, self._kwargs) File "/net/zf18/zy3cx/OpenNMT-py/onmt/bin/train.py", line 115, in batch_producer b = next_batch(0) File "/net/zf18/zy3cx/OpenNMT-py/onmt/bin/train.py", line 111, in next_batch new_batch = next(generator_to_serve) File "/net/zf18/zy3cx/OpenNMT-py/onmt/inputters/inputter.py", line 822, in iter for batch in self._iter_dataset(path): File "/net/zf18/zy3cx/OpenNMT-py/onmt/inputters/inputter.py", line 804, in _iter_dataset for batch in cur_iter: File "/net/zf18/zy3cx/OpenNMT-py/onmt/inputters/inputter.py", line 695, in iter self.device) File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/site-packages/torchtext/data/batch.py", line 34, in init setattr(self, name, field.process(batch, device=device)) File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/site-packages/torchtext/data/field.py", line 237, in process tensor = self.numericalize(padded, device=device) File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/site-packages/torchtext/data/field.py", line 338, in numericalize arr = [self.vocab.stoi[x] for x in arr] File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/site-packages/torchtext/data/field.py", line 338, in arr = [self.vocab.stoi[x] for x in arr] AttributeError: 'Field' object has no attribute 'vocab'**

Does anyone meet similar problem? The vocab in iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt is an 'old style vocab', and I'm not sure if it makes the error.

francoishernandez commented 4 years ago

Hmm not sure if this is linked but you would need to preprocess with the same vocab as the one of the model if you want to -train_from it. You can retrieve it from the checkpoint:

import torch
checkpoint = torch.load(<checkpoint.pt>)
torch.save(checkpoint['vocab'], "vocab.pt")

And then preprocess using -src_vocab "vocab.pt". (Only -src_vocabis necessary here, as both src and tgt vocabs are stored in the .pt file.

ziyanyang commented 4 years ago

Hmm not sure if this is linked but you would need to preprocess with the same vocab as the one of the model if you want to -train_from it. You can retrieve it from the checkpoint:

import torch
checkpoint = torch.load(<checkpoint.pt>)
torch.save(checkpoint['vocab'], "vocab.pt")

And then preprocess using -src_vocab "vocab.pt". (Only -src_vocabis necessary here, as both src and tgt vocabs are stored in the .pt file.

Thank you for the suggestion. I tried it, but the error still exists. I found the pre-trained model's vocab has only [('src', <torchtext.vocab.Vocab object at 0x7f2968601750>), ('tgt', <torchtext.vocab.Vocab object at 0x7f28f3d45d10>)] two parts. However, in onmt/inputters/inputter.py line 692: yield torchtext.data.Batch(minibatch, self.dataset, self.device), the self.dataset will have dict_keys(['src', 'tgt', 'indices', 'corpus_id']) four fields. The last field 'corpus_id' is not built in the pre-trained model's vocab. The error AttributeError: 'Field' object has no attribute 'vocab' indicates this problem.

francoishernandez commented 4 years ago

Oh yes, this field was added in #1732. We probably need to add a patch for such a case.

francoishernandez commented 4 years ago

Hey @ziyanyang

1769 should fix this.

Would you mind checking it's all good on your end before I merge?

ziyanyang commented 4 years ago

Hey @ziyanyang

1769 should fix this.

Would you mind checking it's all good on your end before I merge?

Hi, in train.py the function patch_fields(opt, fields) will get the error as AttributeError: 'list' object has no attribute 'get'. This is because in function patch_fields: dvocab = torch.load(opt.data + '.vocab.pt') will try to load the vocab of the new data, but actually if I process the new data with old vocab, the new data's vocab will be the same as the old data's vocab. Therefore, this step will load the same vocab which is a list instead of a dictionary and still do not have 'corpus_id'.

Is 'corpus_id' the same for most the text data? It includes {'': 0, '': 1, 'train': 2} in multi30k(using its own generated vocab). Is it only used to indicate the type of data?

francoishernandez commented 4 years ago

Is 'corpus_id' the same for most the text data? It includes {'': 0, '': 1, 'train': 2} in multi30k(using its own generated vocab). Is it only used to indicate the type of data?

It's for when we use multiple datasets (-data_ids / -train_ids). The corpus_id field was added in #1732 to track from which corpus each example orgiginate, and apply noise on only examples from some of those datasets. It will probably also be useful in the future to apply different treatment to different datasets.

francoishernandez commented 4 years ago

Ok, I just updated the PR. In preprocess, we will now add the corpus_id field to the existing vocab. And it will also update it to the 'new' dict format. Let me know if that works for you!

ziyanyang commented 4 years ago

Ok, I just updated the PR. In preprocess, we will now add the corpus_id field to the existing vocab. And it will also update it to the 'new' dict format. Let me know if that works for you!

It works fine now. Thank you so much!