Closed allanj closed 5 years ago
I think that looks right? Have you tried it/did it not work or something?
I run a coreference config by allennl train coref.jsonnet -s test
I got this error, it might not be in the list?
2019-08-16 15:53:17,304 - ERROR - pytorch_pretrained_bert.modeling - Model name 'roberta-base' was not found in model name list
(bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese).
We assumed 'roberta-base' was a path or url but couldn't find any file associated to this path or url.
The full log is attached as well. full.log
Could you check your version of pytorch-pretrained-bert
and see if it actually contains roberta-base
?
(roberta was just added yesterday, so i'd expect upgrading would fix the issue: https://github.com/huggingface/pytorch-transformers/releases )
@brendan-ai2 @nelson-liu Thanks. I upgraded the pytorch_transformers
from 1.0.0
to 1.1.0
.
I would still get the error. But one quick trick to make things work is:
bert_token_embedder.py
from pytorch_pretrained_bert.modeling import BertModel
to
from pytorch_transformers import RobertaModel
(Also replace the following code comes with BERTModel
appearance with RobertaModel
.)
Let me know if I'm doing something wrong above.
Then I got the vocab error:
...
File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/iterators/bucket_iterator.py", line 124, in _create_batches
self._padding_noise)
File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/iterators/bucket_iterator.py", line 30, in sort_by_padding
instance.index_fields(vocab)
File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/instance.py", line 72, in index_fields
field.index(vocab)
File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/fields/text_field.py", line 70, in index
token_indices = indexer.tokens_to_indices(self.tokens, vocab, indexer_name)
File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 72, in tokens_to_indices
self._add_encoding_to_vocabulary(vocabulary)
File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 62, in _add_encoding_to_vocabulary
for word, idx in self.tokenizer.vocab.items():
AttributeError: 'NoneType' object has no attribute 'vocab'
I guess the reason is I'm not using the PretrainedTransformerTokenizer
? But my input format is already tokenized.
Seems like I also need a RobertaTokenizer
: https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/tokenization_roberta.py#L66
@allanj did you get it to work? To me, it looks like it also required a migration from pytorch-pretrained-bert
to pytorch_transformers
(#3067), or am I mistaken?
@namednil Yeah, you are right. I was informed that the pytorch_transformers
will be integrated soon. Please stay tune.
It seems RoBERTa was added to the AutoTokenizer that is used in pretrained_transformer_indexer.py after Pytorch-Transformers version 1.1.0 was already released:
https://github.com/huggingface/pytorch-transformers/commit/ab05280666c9e1cfbbb23122825f3a41b7ff82c3#diff-8c1301d19bdbcd77667f3d3b76b7b4d8.
Also seems pretrained_transformer_indexer.py assumes the existence of a vocab member (self.tokenizer.vocab in line 67), which is not provided in the RobertaTokenizer (seems to be called encoder instead)
Yes @alontalmor, roberta needs a bit of additional code. @OyvindTafjord has a fix for this, and one of the two of us will merge it soon.
@allanj, @namednil, @alontalmor, if you're really anxious to have this right now, #3203 and #3204 together should fix this, for cases where you have matched tokenization and embedding (e.g., you use GPT2 and its matching tokenizer, or roberta and its tokenizer). You also currently have to install pytorch-transformers
from master, because there is no released version that includes roberta in AutoTokenizer
and AutoModel
. Configuration is as listed above, except you use the pretrained_transformer
token embedder, instead of bert-pretrained
. You also will likely have to change your tokenizer to match.
Closing this issue, as those two PRs have been merged. Let us know (preferably in a new issue) if you still have issues.
Please first search our GitHub repository for similar questions. If you don't find a similar example you can use the following template:
System (please complete the following information):
Question I was using the configuration for BERT like below:
When I change to
pretrained_transformer
, is the following configuration correct way to use? (Because I didn't find apretrained_transformer
token embedder)