Open JohannesTK opened 4 years ago
We don't have plans to implement it for XLM-R, but our procedure to pretrain longformer starting from the RoBERTa checkpoint (beginning of section5) can be easily applied to most other models (including XLM-R). Here's a summary of the steps:
Thanks for the thorough answer & advice!
@JohannesTK, in case you are still interested, I have just added a notebook that demonstrates how we pretrain Longformer starting from the RoBERTa checkpoint. It should be easy to reuse this notebook to pretrain your XLM-R-Long. The notebook is here: https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb
@ibeltagy, thank you! Will give it a spin.
I tried to do that, but I'm getting an error:
C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\data_parallel.py:26: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
INFO:transformers.trainer: Running Evaluation
INFO:transformers.trainer: Num examples = 2461
INFO:transformers.trainer: Batch size = 16
Evaluation: 0%| | 0/154 [00:01<?, ?it/s]
Traceback (most recent call last):
File "", line 149, in
the code:
import logging import os import math from dataclasses import dataclass, field from transformers import XLMRobertaForMaskedLM, LongformerTokenizerFast, TextDataset, DataCollatorForLanguageModeling, Trainer from transformers import TrainingArguments, HfArgumentParser from transformers.modeling_longformer import LongformerSelfAttention from transformers import AutoTokenizer, AutoModelWithLMHead from transformers import LineByLineTextDataset
logger = logging.getLogger(name) logging.basicConfig(level=logging.INFO)
class XLMRobertaLongForMaskedLM(XLMRobertaForMaskedLM): def init(self, config): super().init(config) for i, layer in enumerate(self.roberta.encoder.layer):
modeling_bert.BertSelfAttention
object with LongformerSelfAttention
layer.attention.self = LongformerSelfAttention(config, layer_id=i)
def create_long_model(save_model_to, attention_window, max_pos): model = XLMRobertaForMaskedLM.from_pretrained('xlm-roberta-large') tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large", model_max_length=max_pos, use_fast=True) config = model.config
# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
max_pos += 2 # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
config.max_position_embeddings = max_pos
assert max_pos > current_max_pos
# allocate a larger position embedding matrix
new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
k += step
model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed
# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
config.attention_window = [attention_window] * config.num_hidden_layers
for i, layer in enumerate(model.roberta.encoder.layer):
longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
longformer_self_attn.query = layer.attention.self.query
longformer_self_attn.key = layer.attention.self.key
longformer_self_attn.value = layer.attention.self.value
longformer_self_attn.query_global = layer.attention.self.query
longformer_self_attn.key_global = layer.attention.self.key
longformer_self_attn.value_global = layer.attention.self.value
layer.attention.self = longformer_self_attn
logger.info(f'saving model to {save_model_to}')
model.save_pretrained(save_model_to)
tokenizer.save_pretrained(save_model_to)
return model, tokenizer
def copy_proj_layers(model): for i, layer in enumerate(model.roberta.encoder.layer): layer.attention.self.query_global = layer.attention.self.query layer.attention.self.key_global = layer.attention.self.key layer.attention.self.value_global = layer.attention.self.value return model
def pretrain_and_evaluate(args, model, tokenizer, eval_only, model_path): val_dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path=args.val_datapath, block_size=tokenizer.max_len) if eval_only: train_dataset = val_dataset else: logger.info(f'Loading and tokenizing training data is usually slow: {args.train_datapath}') train_dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path=args.train_datapath, block_size=tokenizer.max_len)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
trainer = Trainer(model=model, args=args, data_collator=data_collator,
train_dataset=train_dataset, eval_dataset=val_dataset, prediction_loss_only=True, )
eval_loss = trainer.evaluate()
eval_loss = eval_loss['eval_loss']
logger.info(f'Initial eval bpc: {eval_loss / math.log(2)}')
if not eval_only:
trainer.train(model_path=model_path)
trainer.save_model()
eval_loss = trainer.evaluate()
eval_loss = eval_loss['eval_loss']
logger.info(f'Eval bpc after pretraining: {eval_loss / math.log(2)}')
@dataclass class ModelArgs: attention_window: int = field(default=512, metadata={"help": "Size of attention window"}) max_pos: int = field(default=4096, metadata={"help": "Maximum position"})
parser = HfArgumentParser((TrainingArguments, ModelArgs,))
training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[
'--output_dir', 'tmp',
'--warmup_steps', '500',
'--learning_rate', '0.00003',
'--weight_decay', '0.01',
'--adam_epsilon', '1e-6',
'--max_steps', '3000',
'--logging_steps', '500',
'--save_steps', '500',
'--max_grad_norm', '5.0',
'--per_gpu_eval_batch_size', '8',
'--per_gpu_train_batch_size', '1', # 2 - 32GB gpu with fp32
#'--device', 'cuda0', # one GPU
'--gradient_accumulation_steps', '32',
'--evaluate_during_training',
'--do_train',
'--do_eval',
]) training_args.val_datapath = 'wikitext-103-raw/wiki.valid.raw' training_args.train_datapath = 'wikitext-103-raw/wiki.train.raw'
model_path = f'{training_args.output_dir}/xlm-roberta-large-{model_args.max_pos}' if not os.path.exists(model_path): os.makedirs(model_path)
logger.info(f'Converting roberta-base into roberta-large-{model_args.max_pos}') model, tokenizer = create_long_model( save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)
logger.info(f'Loading the model from {model_path}') tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large", use_fast=True) model = XLMRobertaLongForMaskedLM.from_pretrained(model_path)
logger.info(f'Pretraining xlm-roberta-base-{model_args.max_pos} ... ')
training_args.max_steps = 3 ## <<<<<<<<<<<<<<<<<<<<<<<< REMOVE THIS <<<<<<<<<<<<<<<<<<<<<<<<
pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
logger.info(f'Copying local projection layers into global projection layers ... ') model = copy_proj_layers(model) logger.info(f'Saving model to {model_path}') model.save_pretrained(model_path)
I don't know which version of HF you have so can't be sure, but looks like the forward function of BertSelfAttention
here has a different input format compared to LongformerSelfAttention
here. You can implement a small class around LongformerSelfAttention
that takes the input from BERT and convert it to the format expected in LongformerSelfAttention
. We did the same thing when working on converting BART (check here).
Thanks for the response! Unfortunately, I'm a newbie in this area, I would love to have the best multilingual model in longformer, gonna subscribe for any news!
I tried to rerun the original pynb, just for mental healthiness, and I can't make it run. I tried in python 3.8 and 3.7, installing the pip install -r requirements.txt file on pycharm with conda env. I tried on google colab too, no luck there too.
Take a look here: https://colab.research.google.com/drive/1skFNZ1pil1YG6mzO8jLGE4L-AASGTN5E?usp=sharing
My main goal is to make it work first and replace to use the XLMRoberta. I think it will be a simple model change, because xlmroberta is the same as roberta.
Thank you for your help in advance!
Thanks, @davidhsv for reporting this. Looks like the recent release of the HF code changed LongformerSelfAttention
a bit making it less compatible with BertSelfAttention
. I will fix the notebook soon and let you know.
Thanks! I really appreciate that :)
Sorry for not being able to contribute more, I'm a recovering java developer learning data science.
Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html
@davidhsv, @samru-rai, fixed. Can you please try the notebook again and let me know if you run into any issues.
how long it takes to pretrain XLM-R model?
@samru-rai, as mentioned in the notebook, you can still get a reasonable model even with zero pretraining. Additional pretraining definitely helps but for RoBERTa you get diminishing returns after processing around 800M tokens (around 2 days on a single GPU). With models other than RoBERTa, you will probably see the same general pattern but with different numbers.
@ibeltagy regarding diminishing returns on additional pretraining, you mean in terms of improvements on the same pretraining corpus right, not e.g. domain-adaptive pretraining on a different corpus?
@CyndxAI, good point, yes, if you are training the long version + adapting to a new domain, more training will be needed.
Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html
For me, training on a single GPU, with the same hyperparmeters, for 3000 iterations and with transformers 3.0.2, took 3 days and 11h.
Also Used fp16 and no gradient checkpointing
Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html
For me, training on a single GPU, with the same hyperparmeters, for 3000 iterations and with transformers 3.0.2, took 3 days and 11h.
Also Used fp16 and no gradient checkpointing
@MarkusSagen Is there any chance you could share the pretrained multilingual longformer model and inference code?
@rplawate It depends, Im doing a master thesis at a company investigating if long context can be transferred to low-resource languages by extending the context of multilingual models and training in English only. If I get permision from the company, then yes, the aim is to release it to Higgingface. I settned with training for 6000 iterations, but the training and eval loss could be decreased further
@markussagen interesting. hoping that you will opensource it. if it is of any help, i can provide resources for training it further than 6000 iterations.
@MarkusSagen interesting. hoping that you will opensource it. if it is of any help, i can provide resources for training it further than 6000 iterations.
@rplawate I've gotten the go ahead. Send me a mail and we can take it from there if it is still interesting
For those still interested, I've made a model initialized with XLM-RoBERTa's weights without further pretraining. The output of the long version should be identical to the original model for input sequences with lengths < 0.5 * attention window size.
As @ibeltagy mentioned earlier, the intermediate model produced by just copying the position embeddings and linear projects is already good enough to be fine-tuned on a downstream task.
The model could also be used as a starting point for pretraining on other languages, like what @MarkusSagen did with the English WikiText-103 corpus.
Variants of the model are available on Hugging Face model hub:
Model | attention_window | hidden_size | num_hidden_layers | model_max_length |
---|---|---|---|---|
base | 256 | 768 | 12 | 16384 |
large | 512 | 1024 | 24 | 16384 |
And the notebook for replicating the models is available here: https://github.com/hyperonym/dirge/blob/master/models/xlm-roberta-longformer/convert.ipynb - Instead of swapping the self-attention implementation of RoBERTa, the notebook started with a blank Longformer and copied the weights into it. It might be easier for converting other BERTology variants to their long versions.
@peakji Hey! thanks for sharing. When I click the notebook link I get a 404 error. Can you share it again?
Hey,
Congratulations on the impressive results and thank you for open-sourcing the work! 🤗
I have a question, do you also plan to implement Longformer for XLM-R because cross-lingual NLP with long text would be extremely useful?
Thanks & stay healthy, Johannes