Configuration of using RoBERTa

allanj commented 5 years ago

Please first search our GitHub repository for similar questions. If you don't find a similar example you can use the following template:

System (please complete the following information):

OS: Linux
Python version: 3.6.8
AllenNLP version: Latest on master branch
PyTorch version: 1.1

Question I was using the configuration for BERT like below:

local bert_model = "bert-base-uncased";
....
"token_indexers": {
      "bert": {
        "type": "bert-pretrained",
        "pretrained_model": bert_model,
        "do_lowercase": true,
        "use_starting_offsets": true,
        "truncate_long_sequences": false
      }
    }
....
"text_field_embedder": {
      "allow_unmatched_keys": true,
      "embedder_to_indexer_map": {
          "bert": ["bert", "bert-offsets", "bert-type-ids"]
      },
      "token_embedders": {
        "bert":{
            "type": "bert-pretrained",
            "pretrained_model": bert_model,
            "top_layer_only": true,
             "requires_grad": true
        }
      }
 }

When I change to pretrained_transformer, is the following configuration correct way to use? (Because I didn't find a pretrained_transformer token embedder)

local bert_model = "roberta-base";
....
"token_indexers": {
      "bert": {
        "type": "pretrained_transformer",
        "pretrained_model": bert_model,
        "do_lowercase": true
      }
    }
....
"text_field_embedder": {
      "allow_unmatched_keys": true,
      "embedder_to_indexer_map": {
          "bert": ["bert", "bert-offsets", "bert-type-ids"]
      },
      "token_embedders": {
        "bert":{
            "type": "bert-pretrained",
            "pretrained_model": bert_model,
            "top_layer_only": true,
             "requires_grad": true
        }
      }
 }

DeNeutoy commented 5 years ago

I think that looks right? Have you tried it/did it not work or something?

allanj commented 5 years ago

I run a coreference config by allennl train coref.jsonnet -s test I got this error, it might not be in the list?

2019-08-16 15:53:17,304 - ERROR - pytorch_pretrained_bert.modeling - Model name 'roberta-base' was not found in model name list 
(bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). 
We assumed 'roberta-base' was a path or url but couldn't find any file associated to this path or url.

The full log is attached as well. full.log

brendan-ai2 commented 5 years ago

Could you check your version of pytorch-pretrained-bert and see if it actually contains roberta-base?

nelson-liu commented 5 years ago

(roberta was just added yesterday, so i'd expect upgrading would fix the issue: https://github.com/huggingface/pytorch-transformers/releases )

allanj commented 5 years ago

@brendan-ai2 @nelson-liu Thanks. I upgraded the pytorch_transformers from 1.0.0 to 1.1.0.

I would still get the error. But one quick trick to make things work is:

Change this line in the bert_token_embedder.py
```
from pytorch_pretrained_bert.modeling import BertModel
```
to
```
from pytorch_transformers import RobertaModel
```
(Also replace the following code comes with BERTModel appearance with RobertaModel.)

Let me know if I'm doing something wrong above.

Then I got the vocab error:

...
  File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/iterators/bucket_iterator.py", line 124, in _create_batches
    self._padding_noise)
  File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/iterators/bucket_iterator.py", line 30, in sort_by_padding
    instance.index_fields(vocab)
  File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/instance.py", line 72, in index_fields
    field.index(vocab)
  File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/fields/text_field.py", line 70, in index
    token_indices = indexer.tokens_to_indices(self.tokens, vocab, indexer_name)
  File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 72, in tokens_to_indices
    self._add_encoding_to_vocabulary(vocabulary)
  File "/Users/allanj/latest_allennlp/allennlp/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 62, in _add_encoding_to_vocabulary
    for word, idx in self.tokenizer.vocab.items():
AttributeError: 'NoneType' object has no attribute 'vocab'

I guess the reason is I'm not using the PretrainedTransformerTokenizer? But my input format is already tokenized.

allanj commented 5 years ago

Seems like I also need a RobertaTokenizer: https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/tokenization_roberta.py#L66

namednil commented 5 years ago

@allanj did you get it to work? To me, it looks like it also required a migration from pytorch-pretrained-bert to pytorch_transformers (#3067), or am I mistaken?

allanj commented 5 years ago

@namednil Yeah, you are right. I was informed that the pytorch_transformers will be integrated soon. Please stay tune.

alontalmor commented 5 years ago

It seems RoBERTa was added to the AutoTokenizer that is used in pretrained_transformer_indexer.py after Pytorch-Transformers version 1.1.0 was already released: https://github.com/huggingface/pytorch-transformers/commit/ab05280666c9e1cfbbb23122825f3a41b7ff82c3#diff-8c1301d19bdbcd77667f3d3b76b7b4d8.
Also seems pretrained_transformer_indexer.py assumes the existence of a vocab member (self.tokenizer.vocab in line 67), which is not provided in the RobertaTokenizer (seems to be called encoder instead)

matt-gardner commented 5 years ago

Yes @alontalmor, roberta needs a bit of additional code. @OyvindTafjord has a fix for this, and one of the two of us will merge it soon.

matt-gardner commented 5 years ago

@allanj, @namednil, @alontalmor, if you're really anxious to have this right now, #3203 and #3204 together should fix this, for cases where you have matched tokenization and embedding (e.g., you use GPT2 and its matching tokenizer, or roberta and its tokenizer). You also currently have to install pytorch-transformers from master, because there is no released version that includes roberta in AutoTokenizer and AutoModel. Configuration is as listed above, except you use the pretrained_transformer token embedder, instead of bert-pretrained. You also will likely have to change your tokenizer to match.

matt-gardner commented 5 years ago

Closing this issue, as those two PRs have been merged. Let us know (preferably in a new issue) if you still have issues.

allenai / allennlp

Configuration of using RoBERTa #3161