Closed ziyanyang closed 4 years ago
Hmm not sure if this is linked but you would need to preprocess with the same vocab as the one of the model if you want to -train_from
it.
You can retrieve it from the checkpoint:
import torch
checkpoint = torch.load(<checkpoint.pt>)
torch.save(checkpoint['vocab'], "vocab.pt")
And then preprocess using -src_vocab "vocab.pt"
.
(Only -src_vocab
is necessary here, as both src and tgt vocabs are stored in the .pt file.
Hmm not sure if this is linked but you would need to preprocess with the same vocab as the one of the model if you want to
-train_from
it. You can retrieve it from the checkpoint:import torch checkpoint = torch.load(<checkpoint.pt>) torch.save(checkpoint['vocab'], "vocab.pt")
And then preprocess using
-src_vocab "vocab.pt"
. (Only-src_vocab
is necessary here, as both src and tgt vocabs are stored in the .pt file.
Thank you for the suggestion. I tried it, but the error still exists. I found the pre-trained model's vocab has only [('src', <torchtext.vocab.Vocab object at 0x7f2968601750>), ('tgt', <torchtext.vocab.Vocab object at 0x7f28f3d45d10>)] two parts. However, in onmt/inputters/inputter.py line 692: yield torchtext.data.Batch(minibatch, self.dataset, self.device), the self.dataset will have dict_keys(['src', 'tgt', 'indices', 'corpus_id']) four fields. The last field 'corpus_id' is not built in the pre-trained model's vocab. The error AttributeError: 'Field' object has no attribute 'vocab' indicates this problem.
Oh yes, this field was added in #1732. We probably need to add a patch for such a case.
Hey @ziyanyang
Would you mind checking it's all good on your end before I merge?
Hey @ziyanyang
1769 should fix this.
Would you mind checking it's all good on your end before I merge?
Hi, in train.py the function patch_fields(opt, fields) will get the error as AttributeError: 'list' object has no attribute 'get'. This is because in function patch_fields: dvocab = torch.load(opt.data + '.vocab.pt') will try to load the vocab of the new data, but actually if I process the new data with old vocab, the new data's vocab will be the same as the old data's vocab. Therefore, this step will load the same vocab which is a list instead of a dictionary and still do not have 'corpus_id'.
Is 'corpus_id' the same for most the text data? It includes {'
Is 'corpus_id' the same for most the text data? It includes {'': 0, '': 1, 'train': 2} in multi30k(using its own generated vocab). Is it only used to indicate the type of data?
It's for when we use multiple datasets (-data_ids / -train_ids). The corpus_id field was added in #1732 to track from which corpus each example orgiginate, and apply noise on only examples from some of those datasets. It will probably also be useful in the future to apply different treatment to different datasets.
Ok, I just updated the PR. In preprocess, we will now add the corpus_id field to the existing vocab. And it will also update it to the 'new' dict format. Let me know if that works for you!
Ok, I just updated the PR. In preprocess, we will now add the corpus_id field to the existing vocab. And it will also update it to the 'new' dict format. Let me know if that works for you!
It works fine now. Thank you so much!
Hi,
I'm trying to do domain adaptation as described here(https://github.com/OpenNMT/OpenNMT-py/issues/768). I want to finetune a pre-trained model (from https://opennmt.net/Models-py/) using multi30k and follow the instructions to pre-process the data here(https://opennmt.net/OpenNMT-py/extended.html). However, when I retrain the model using: CUDA_VISIBLE_DEVICES=1,2 python train.py -world_size 2 -gpu_ranks 0 1 -batch_size 64 -encoder_type brnn -rnn_size 500 -save_model available_models/multi30k_finetune -data data/multi30k.atok.low -reset_optim keep_states -train_from available_models/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt -learning_rate 0.1
I get the error as: [2020-03-31 23:34:16,775 INFO] Loading dataset from data/multi30k.atok.low.train.0.pt [2020-03-31 23:34:17,091 INFO] number of examples: 29000 [2020-03-31 23:34:17,474 INFO] Loading checkpoint from available_models/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt [2020-03-31 23:34:17,771 INFO] Loading vocab from checkpoint at available_models/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt. [2020-03-31 23:34:17,771 INFO] src vocab size = 35444 [2020-03-31 23:34:17,771 INFO] tgt vocab size = 24725 [2020-03-31 23:34:17,771 INFO] Building model... Process SpawnProcess-3: Traceback (most recent call last): File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, self._kwargs) File "/net/zf18/zy3cx/OpenNMT-py/onmt/bin/train.py", line 115, in batch_producer b = next_batch(0) File "/net/zf18/zy3cx/OpenNMT-py/onmt/bin/train.py", line 111, in next_batch new_batch = next(generator_to_serve) File "/net/zf18/zy3cx/OpenNMT-py/onmt/inputters/inputter.py", line 822, in iter for batch in self._iter_dataset(path): File "/net/zf18/zy3cx/OpenNMT-py/onmt/inputters/inputter.py", line 804, in _iter_dataset for batch in cur_iter: File "/net/zf18/zy3cx/OpenNMT-py/onmt/inputters/inputter.py", line 695, in iter self.device) File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/site-packages/torchtext/data/batch.py", line 34, in init setattr(self, name, field.process(batch, device=device)) File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/site-packages/torchtext/data/field.py", line 237, in process tensor = self.numericalize(padded, device=device) File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/site-packages/torchtext/data/field.py", line 338, in numericalize arr = [self.vocab.stoi[x] for x in arr] File "/zf18/zy3cx/ENTER/envs/rnn_language_model/lib/python3.7/site-packages/torchtext/data/field.py", line 338, in
arr = [self.vocab.stoi[x] for x in arr]
AttributeError: 'Field' object has no attribute 'vocab'**
Does anyone meet similar problem? The vocab in iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt is an 'old style vocab', and I'm not sure if it makes the error.