Open Maybewuss opened 4 years ago
commenting for visibility - i'm facing the same bug:
Error(s) in loading state_dict for BertLongForQuestionAnswering:
size mismatch for bert.embeddings.position_ids: copying a param with shape torch.Size([1, 512]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
The model i'm saving is with the correct dimensions (i think?)
BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(28996, 768, padding_idx=0)
(position_embeddings): Embedding(4096, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
@ibeltagy any help would be appreciated!
Here's the code that i managed to get working.
Key changes -> the way i updated position_embeddings and position_ids
import tensorflow as tf
def create_long_model(save_model_to, attention_window, max_pos, model_path_args):
model = BertForQuestionAnswering.from_pretrained(model_path_args)
tokenizer = AutoTokenizer.from_pretrained(model_path_args)
config = model.config
print(max_pos)
# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.bert.embeddings.position_embeddings.weight.shape
config.max_position_embeddings = max_pos
assert max_pos > current_max_pos
# allocate a larger position embedding matrix
new_pos_embed = model.bert.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
print(new_pos_embed.shape)
print(model.bert.embeddings.position_embeddings)
# copy position embeddings over and over to initialize the new position embeddings
k = 0
step = current_max_pos
while k < max_pos - 1:
new_pos_embed[k:(k + step)] = model.bert.embeddings.position_embeddings.weight
k += step
print(new_pos_embed.shape)
model.bert.embeddings.position_ids = torch.from_numpy(tf.range(new_pos_embed.shape[0], dtype=tf.int32).numpy()[tf.newaxis, :])
model.bert.embeddings.position_embeddings = torch.nn.Embedding.from_pretrained(new_pos_embed)
# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
config.attention_window = [attention_window] * config.num_hidden_layers
for i, layer in enumerate(model.bert.encoder.layer):
longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
longformer_self_attn.query = layer.attention.self.query
longformer_self_attn.key = layer.attention.self.key
longformer_self_attn.value = layer.attention.self.value
longformer_self_attn.query_global = layer.attention.self.query
longformer_self_attn.key_global = layer.attention.self.key
longformer_self_attn.value_global = layer.attention.self.value
layer.attention.self = longformer_self_attn
print(model.bert.embeddings.position_ids.shape)
logger.info(f'saving model to {save_model_to}')
model.save_pretrained(save_model_to)
tokenizer.save_pretrained(save_model_to)
return model, tokenizer, new_pos_embed
Glad you figured it out @krrishdholakia. Yes, the Roberta example is for demonstration and as you did, it needs to be adapted to different models to take their details into account.
@krrishdholakia Could you please share with us the full code?
I tried to convert roberta-base, and met the same problem too. Could anyone provide a full script to convert model ?
I met the same problem, and I found that is the transformers module version problem. The notebook 'convert_model_to_long.ipynb' uses transformers==3.0.2, but it has the bug when i use transformers==3.3.1. Here is my solution:
in the function create_long_model() of the 'convert_model_to_long.ipynb'
after the line of code (here I use' bert'. for other models, replace 'bert' with another model): new_pos_embed = model.bert.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
add one line of code: model.bert.embeddings.register_buffer("position_ids",torch.arange(config.max_position_embeddings).expand((1, -1)),)
It can solve the problem of 'size mismatch for bert.embeddings.position_ids'
@LIUYANZHI88 Yes, you are right. But when I try to inference, I meet a problem.
model_name: xlm-roberta-large
tokenizer = AutoModelForMaskedLM.from_pretrained(model_path)
model = AutoTokenizer.from_pretrained(model_path)
error message:
/anaconda2/envs/longformer/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
1707 for token, index in added_tok_encoder_sorted:
1708 assert index == len(tokenizer), (
-> 1709 f"Non-consecutive added token '{token}' found. "
1710 f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary."
1711 )
AssertionError: Non-consecutive added token '<s>' found. Should have index 250002 but has index 0 in saved vocabulary.
It's very weird.
I remove the file added_tokens.json
, the error disappears. But there is a very long warning about network initialization.
INFO:__main__:Loading the model from tmp/xlm-roberta-large-4096
Some weights of the model checkpoint at tmp/xlm-roberta-large-4096 were not used when initializing XLMRobertaForMaskedLM: ['roberta.encoder.layer.0.attention.self.query_global.weight', 'roberta.encoder.layer.0.attention.self.query_global.bias', 'roberta.encoder.layer.0.attention.self.key_global.weight', 'roberta.encoder.layer.0.attention.self.key_global.bias', 'roberta.encoder.layer.0.attention.self.value_global.weight', 'roberta.encoder.layer.0.attention.self.value_global.bias', 'roberta.encoder.layer.1.attention.self.query_global.weight', 'roberta.encoder.layer.1.attention.self.query_global.bias', 'roberta.encoder.layer.1.attention.self.key_global.weight', 'roberta.encoder.layer.1.attention.self.key_global.bias', 'roberta.encoder.layer.1.attention.self.value_global.weight', 'roberta.encoder.layer.1.attention.self.value_global.bias', 'roberta.encoder.layer.2.attention.self.query_global.weight', 'roberta.encoder.layer.2.attention.self.query_global.bias', 'roberta.encoder.layer.2.attention.self.key_global.weight', 'roberta.encoder.layer.2.attention.self.key_global.bias', 'roberta.encoder.layer.2.attention.self.value_global.weight', 'roberta.encoder.layer.2.attention.self.value_global.bias', 'roberta.encoder.layer.3.attention.self.query_global.weight', 'roberta.encoder.layer.3.attention.self.query_global.bias', 'roberta.encoder.layer.3.attention.self.key_global.weight', 'roberta.encoder.layer.3.attention.self.key_global.bias', 'roberta.encoder.layer.3.attention.self.value_global.weight', 'roberta.encoder.layer.3.attention.self.value_global.bias', 'roberta.encoder.layer.4.attention.self.query_global.weight', 'roberta.encoder.layer.4.attention.self.query_global.bias', 'roberta.encoder.layer.4.attention.self.key_global.weight', 'roberta.encoder.layer.4.attention.self.key_global.bias', 'roberta.encoder.layer.4.attention.self.value_global.weight', 'roberta.encoder.layer.4.attention.self.value_global.bias', 'roberta.encoder.layer.5.attention.self.query_global.weight', 'roberta.encoder.layer.5.attention.self.query_global.bias', 'roberta.encoder.layer.5.attention.self.key_global.weight', 'roberta.encoder.layer.5.attention.self.key_global.bias', 'roberta.encoder.layer.5.attention.self.value_global.weight', 'roberta.encoder.layer.5.attention.self.value_global.bias', 'roberta.encoder.layer.6.attention.self.query_global.weight', 'roberta.encoder.layer.6.attention.self.query_global.bias', 'roberta.encoder.layer.6.attention.self.key_global.weight', 'roberta.encoder.layer.6.attention.self.key_global.bias', 'roberta.encoder.layer.6.attention.self.value_global.weight', 'roberta.encoder.layer.6.attention.self.value_global.bias', 'roberta.encoder.layer.7.attention.self.query_global.weight', 'roberta.encoder.layer.7.attention.self.query_global.bias', 'roberta.encoder.layer.7.attention.self.key_global.weight', 'roberta.encoder.layer.7.attention.self.key_global.bias', 'roberta.encoder.layer.7.attention.self.value_global.weight', 'roberta.encoder.layer.7.attention.self.value_global.bias', 'roberta.encoder.layer.8.attention.self.query_global.weight', 'roberta.encoder.layer.8.attention.self.query_global.bias', 'roberta.encoder.layer.8.attention.self.key_global.weight', 'roberta.encoder.layer.8.attention.self.key_global.bias', 'roberta.encoder.layer.8.attention.self.value_global.weight', 'roberta.encoder.layer.8.attention.self.value_global.bias', 'roberta.encoder.layer.9.attention.self.query_global.weight', 'roberta.encoder.layer.9.attention.self.query_global.bias', 'roberta.encoder.layer.9.attention.self.key_global.weight', 'roberta.encoder.layer.9.attention.self.key_global.bias', 'roberta.encoder.layer.9.attention.self.value_global.weight', 'roberta.encoder.layer.9.attention.self.value_global.bias', 'roberta.encoder.layer.10.attention.self.query_global.weight', 'roberta.encoder.layer.10.attention.self.query_global.bias', 'roberta.encoder.layer.10.attention.self.key_global.weight', 'roberta.encoder.layer.10.attention.self.key_global.bias', 'roberta.encoder.layer.10.attention.self.value_global.weight', 'roberta.encoder.layer.10.attention.self.value_global.bias', 'roberta.encoder.layer.11.attention.self.query_global.weight', 'roberta.encoder.layer.11.attention.self.query_global.bias', 'roberta.encoder.layer.11.attention.self.key_global.weight', 'roberta.encoder.layer.11.attention.self.key_global.bias', 'roberta.encoder.layer.11.attention.self.value_global.weight', 'roberta.encoder.layer.11.attention.self.value_global.bias', 'roberta.encoder.layer.12.attention.self.query_global.weight', 'roberta.encoder.layer.12.attention.self.query_global.bias', 'roberta.encoder.layer.12.attention.self.key_global.weight', 'roberta.encoder.layer.12.attention.self.key_global.bias', 'roberta.encoder.layer.12.attention.self.value_global.weight', 'roberta.encoder.layer.12.attention.self.value_global.bias', 'roberta.encoder.layer.13.attention.self.query_global.weight', 'roberta.encoder.layer.13.attention.self.query_global.bias', 'roberta.encoder.layer.13.attention.self.key_global.weight', 'roberta.encoder.layer.13.attention.self.key_global.bias', 'roberta.encoder.layer.13.attention.self.value_global.weight', 'roberta.encoder.layer.13.attention.self.value_global.bias', 'roberta.encoder.layer.14.attention.self.query_global.weight', 'roberta.encoder.layer.14.attention.self.query_global.bias', 'roberta.encoder.layer.14.attention.self.key_global.weight', 'roberta.encoder.layer.14.attention.self.key_global.bias', 'roberta.encoder.layer.14.attention.self.value_global.weight', 'roberta.encoder.layer.14.attention.self.value_global.bias', 'roberta.encoder.layer.15.attention.self.query_global.weight', 'roberta.encoder.layer.15.attention.self.query_global.bias', 'roberta.encoder.layer.15.attention.self.key_global.weight', 'roberta.encoder.layer.15.attention.self.key_global.bias', 'roberta.encoder.layer.15.attention.self.value_global.weight', 'roberta.encoder.layer.15.attention.self.value_global.bias', 'roberta.encoder.layer.16.attention.self.query_global.weight', 'roberta.encoder.layer.16.attention.self.query_global.bias', 'roberta.encoder.layer.16.attention.self.key_global.weight', 'roberta.encoder.layer.16.attention.self.key_global.bias', 'roberta.encoder.layer.16.attention.self.value_global.weight', 'roberta.encoder.layer.16.attention.self.value_global.bias', 'roberta.encoder.layer.17.attention.self.query_global.weight', 'roberta.encoder.layer.17.attention.self.query_global.bias', 'roberta.encoder.layer.17.attention.self.key_global.weight', 'roberta.encoder.layer.17.attention.self.key_global.bias', 'roberta.encoder.layer.17.attention.self.value_global.weight', 'roberta.encoder.layer.17.attention.self.value_global.bias', 'roberta.encoder.layer.18.attention.self.query_global.weight', 'roberta.encoder.layer.18.attention.self.query_global.bias', 'roberta.encoder.layer.18.attention.self.key_global.weight', 'roberta.encoder.layer.18.attention.self.key_global.bias', 'roberta.encoder.layer.18.attention.self.value_global.weight', 'roberta.encoder.layer.18.attention.self.value_global.bias', 'roberta.encoder.layer.19.attention.self.query_global.weight', 'roberta.encoder.layer.19.attention.self.query_global.bias', 'roberta.encoder.layer.19.attention.self.key_global.weight', 'roberta.encoder.layer.19.attention.self.key_global.bias', 'roberta.encoder.layer.19.attention.self.value_global.weight', 'roberta.encoder.layer.19.attention.self.value_global.bias', 'roberta.encoder.layer.20.attention.self.query_global.weight', 'roberta.encoder.layer.20.attention.self.query_global.bias', 'roberta.encoder.layer.20.attention.self.key_global.weight', 'roberta.encoder.layer.20.attention.self.key_global.bias', 'roberta.encoder.layer.20.attention.self.value_global.weight', 'roberta.encoder.layer.20.attention.self.value_global.bias', 'roberta.encoder.layer.21.attention.self.query_global.weight', 'roberta.encoder.layer.21.attention.self.query_global.bias', 'roberta.encoder.layer.21.attention.self.key_global.weight', 'roberta.encoder.layer.21.attention.self.key_global.bias', 'roberta.encoder.layer.21.attention.self.value_global.weight', 'roberta.encoder.layer.21.attention.self.value_global.bias', 'roberta.encoder.layer.22.attention.self.query_global.weight', 'roberta.encoder.layer.22.attention.self.query_global.bias', 'roberta.encoder.layer.22.attention.self.key_global.weight', 'roberta.encoder.layer.22.attention.self.key_global.bias', 'roberta.encoder.layer.22.attention.self.value_global.weight', 'roberta.encoder.layer.22.attention.self.value_global.bias', 'roberta.encoder.layer.23.attention.self.query_global.weight', 'roberta.encoder.layer.23.attention.self.query_global.bias', 'roberta.encoder.layer.23.attention.self.key_global.weight', 'roberta.encoder.layer.23.attention.self.key_global.bias', 'roberta.encoder.layer.23.attention.self.value_global.weight', 'roberta.encoder.layer.23.attention.self.value_global.bias']
- This IS expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
@thesby , did you manage to solve that warning? I'm facing the same issue:
INFO:pytorch_pretrained_bert.modeling:Weights from pretrained model not used in BertModel: ['encoder.layer.0.attention.self.query_global.weight', 'encoder.layer.0.attention.self.query_global.bias', 'encoder.layer.0.attention.self.key_global.weight', 'encoder.layer.0.attention.self.key_global.bias', 'encoder.layer.0.attention.self.value_global.weight', 'encoder.layer.0.attention.self.value_global.bias', 'encoder.layer.1.attention.self.query_global.weight', 'encoder.layer.1.attention.self.query_global.bias', 'encoder.layer.1.attention.self.key_global.weight', 'encoder.layer.1.attention.self.key_global.bias', 'encoder.layer.1.attention.self.value_global.weight', 'encoder.layer.1.attention.self.value_global.bias', 'encoder.layer.2.attention.self.query_global.weight', 'encoder.layer.2.attention.self.query_global.bias', 'encoder.layer.2.attention.self.key_global.weight', 'encoder.layer.2.attention.self.key_global.bias', 'encoder.layer.2.attention.self.value_global.weight', 'encoder.layer.2.attention.self.value_global.bias', 'encoder.layer.3.attention.self.query_global.weight', 'encoder.layer.3.attention.self.query_global.bias', 'encoder.layer.3.attention.self.key_global.weight', 'encoder.layer.3.attention.self.key_global.bias', 'encoder.layer.3.attention.self.value_global.weight', 'encoder.layer.3.attention.self.value_global.bias', 'encoder.layer.4.attention.self.query_global.weight', 'encoder.layer.4.attention.self.query_global.bias', 'encoder.layer.4.attention.self.key_global.weight', 'encoder.layer.4.attention.self.key_global.bias', 'encoder.layer.4.attention.self.value_global.weight', 'encoder.layer.4.attention.self.value_global.bias', 'encoder.layer.5.attention.self.query_global.weight', 'encoder.layer.5.attention.self.query_global.bias', 'encoder.layer.5.attention.self.key_global.weight', 'encoder.layer.5.attention.self.key_global.bias', 'encoder.layer.5.attention.self.value_global.weight', 'encoder.layer.5.attention.self.value_global.bias', 'encoder.layer.6.attention.self.query_global.weight', 'encoder.layer.6.attention.self.query_global.bias', 'encoder.layer.6.attention.self.key_global.weight', 'encoder.layer.6.attention.self.key_global.bias', 'encoder.layer.6.attention.self.value_global.weight', 'encoder.layer.6.attention.self.value_global.bias', 'encoder.layer.7.attention.self.query_global.weight', 'encoder.layer.7.attention.self.query_global.bias', 'encoder.layer.7.attention.self.key_global.weight', 'encoder.layer.7.attention.self.key_global.bias', 'encoder.layer.7.attention.self.value_global.weight', 'encoder.layer.7.attention.self.value_global.bias', 'encoder.layer.8.attention.self.query_global.weight', 'encoder.layer.8.attention.self.query_global.bias', 'encoder.layer.8.attention.self.key_global.weight', 'encoder.layer.8.attention.self.key_global.bias', 'encoder.layer.8.attention.self.value_global.weight', 'encoder.layer.8.attention.self.value_global.bias', 'encoder.layer.9.attention.self.query_global.weight', 'encoder.layer.9.attention.self.query_global.bias', 'encoder.layer.9.attention.self.key_global.weight', 'encoder.layer.9.attention.self.key_global.bias', 'encoder.layer.9.attention.self.value_global.weight', 'encoder.layer.9.attention.self.value_global.bias', 'encoder.layer.10.attention.self.query_global.weight', 'encoder.layer.10.attention.self.query_global.bias', 'encoder.layer.10.attention.self.key_global.weight', 'encoder.layer.10.attention.self.key_global.bias', 'encoder.layer.10.attention.self.value_global.weight', 'encoder.layer.10.attention.self.value_global.bias', 'encoder.layer.11.attention.self.query_global.weight', 'encoder.layer.11.attention.self.query_global.bias', 'encoder.layer.11.attention.self.key_global.weight', 'encoder.layer.11.attention.self.key_global.bias', 'encoder.layer.11.attention.self.value_global.weight', 'encoder.layer.11.attention.self.value_global.bias']
@thesby @dcaled I think you both have the same problem. For @thesby example, you have converted the pretrained xlm-robert-large model to a long version and saved its weights, and we can call it "xlm-robert-large-long". Then you try to initialize a XLMRobertaForMaskedLM (the huggingface/transformers built in class) from the xlm-robert-large-long model. The problem is that the built in XLMRobertaForMaskedLM class is not the long version model, so it won't use some weights of "xlm-robert-large-long". As "convert_model_to_long.ipynb" shows in the beginning, you can write a class like "XLMRobertaLongForMaskedLM" (the long version model), and then you should use this class to load the weights of the long version model you create. @dcaled And the same to you, you should write a class like "BertLongModel", and use this class to initialize weights. I think it can solve the warning.
@LIUYANZHI88 , your suggestion fixed my first problem. Thank you very much for helping.
However, now I might have a new problem with the tokenizer. Just like the original "convert_model_to_long.ipynb" indicates, I'm using the same tokenizer class (BertTokenizerFast) in create_long_model function and when loading the pretrained model from the disk:
model = BertLong.from_pretrained(model_path) tokenizer = BertTokenizerFast.from_pretrained(model_path)
Here is the INFO:
INFO:transformers.tokenization_utils_base:Model name 'tmp/bert-base-4096' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'tmp/bert-base-4096' is a path, a model identifier, or url to a directory containing tokenizer files.
INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/added_tokens.json. We won't load it.
INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/tokenizer.json. We won't load it.
INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/vocab.txt
INFO:transformers.tokenization_utils_base:loading file None
INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/special_tokens_map.json
INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/tokenizer_config.json
INFO:transformers.tokenization_utils_base:loading file None
Actually, added_tokens.json and tokenizer.json are not there, but the other files are. Would that be a problem? I noticed the same thing happens in the original "convert_model_to_long.ipynb". Any idea of how to fix it?
Here is my complete solution: convert_bert_to_long.ipynb
Hi @dcaled I did exactly what you have done and it works for me when i use transformers 3.1 and bert-base-uncased.
Hi @dcaled I did exactly what you have done and it works for me when i use transformers 3.1 and bert-base-uncased.
I tested with both and still get the warning when loading the tokenizer. I'm using google colab.
@LIUYANZHI88 , your suggestion fixed my first problem. Thank you very much for helping.
However, now I might have a new problem with the tokenizer. Just like the original "convert_model_to_long.ipynb" indicates, I'm using the same tokenizer class (BertTokenizerFast) in create_long_model function and when loading the pretrained model from the disk:
model = BertLong.from_pretrained(model_path) tokenizer = BertTokenizerFast.from_pretrained(model_path)
Here is the INFO:
INFO:transformers.tokenization_utils_base:Model name 'tmp/bert-base-4096' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'tmp/bert-base-4096' is a path, a model identifier, or url to a directory containing tokenizer files.
INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/added_tokens.json. We won't load it.
INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/tokenizer.json. We won't load it.
INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/vocab.txt
INFO:transformers.tokenization_utils_base:loading file None
INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/special_tokens_map.json
INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/tokenizer_config.json
INFO:transformers.tokenization_utils_base:loading file None
Actually, added_tokens.json and tokenizer.json are not there, but the other files are. Would that be a problem? I noticed the same thing happens in the original "convert_model_to_long.ipynb". Any idea of how to fix it?
Here is my complete solution: convert_bert_to_long.ipynb
I think it won't be a problem that there is not added_tokens.json or tokenizer.json. For me, the file vocab.txt is enough for initializing a tokenizer. I think you could ignore these INFO. These INFO might be new feature of huggingface/transformers. I remember the tokenizer of old versions doesn't need added_tokens.json or tokenizer.json .
After convert bert model to long version, i try to load, but falled Error(s) in loading state_dict for RobertaLongForMaskedLM: size mismatch for embeddings.position_ids: copying a param with shape torch.Size([1, 512]) from checkpoint, the shape in current model is torch.Size([1, 4096]).