allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.04k stars 276 forks source link

XLM-R support #21

Open JohannesTK opened 4 years ago

JohannesTK commented 4 years ago

Hey,

Congratulations on the impressive results and thank you for open-sourcing the work! 🤗

I have a question, do you also plan to implement Longformer for XLM-R because cross-lingual NLP with long text would be extremely useful?

Thanks & stay healthy, Johannes

ibeltagy commented 4 years ago

We don't have plans to implement it for XLM-R, but our procedure to pretrain longformer starting from the RoBERTa checkpoint (beginning of section5) can be easily applied to most other models (including XLM-R). Here's a summary of the steps:

JohannesTK commented 4 years ago

Thanks for the thorough answer & advice!

ibeltagy commented 4 years ago

@JohannesTK, in case you are still interested, I have just added a notebook that demonstrates how we pretrain Longformer starting from the RoBERTa checkpoint. It should be easy to reuse this notebook to pretrain your XLM-R-Long. The notebook is here: https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb

JohannesTK commented 4 years ago

@ibeltagy, thank you! Will give it a spin.

davidhsv commented 4 years ago

I tried to do that, but I'm getting an error:

C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\data_parallel.py:26: UserWarning: There is an imbalance between your GPUs. You may want to exclude GPU 1 which has less than 75% of the memory or cores of GPU 0. You can do so by setting the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES environment variable. warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos])) INFO:transformers.trainer: Running Evaluation INFO:transformers.trainer: Num examples = 2461 INFO:transformers.trainer: Batch size = 16 Evaluation: 0%| | 0/154 [00:01<?, ?it/s] Traceback (most recent call last): File "", line 149, in File "", line 87, in pretrain_and_evaluate File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\trainer.py", line 745, in evaluate output = self._prediction_loop(eval_dataloader, description="Evaluation") File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\trainer.py", line 823, in _prediction_loop outputs = model(inputs) File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(*input, *kwargs) File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 85, in parallel_apply output.reraise() File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch_utils.py", line 395, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in replica 0 on device 0. Original Traceback (most recent call last): File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 60, in _worker output = module(input, kwargs) File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(*input, kwargs) File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_roberta.py", line 231, in forward outputs = self.roberta( File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(*input, *kwargs) File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 755, in forward encoder_outputs = self.encoder( File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(input, kwargs) File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 433, in forward layer_outputs = layer_module( File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(*input, kwargs) File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 370, in forward self_attention_outputs = self.attention( File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(*input, *kwargs) File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 314, in forward self_outputs = self.self( File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(input, kwargs) TypeError: forward() takes from 2 to 4 positional arguments but 7 were given

davidhsv commented 4 years ago

the code:

%%

import logging import os import math from dataclasses import dataclass, field from transformers import XLMRobertaForMaskedLM, LongformerTokenizerFast, TextDataset, DataCollatorForLanguageModeling, Trainer from transformers import TrainingArguments, HfArgumentParser from transformers.modeling_longformer import LongformerSelfAttention from transformers import AutoTokenizer, AutoModelWithLMHead from transformers import LineByLineTextDataset

logger = logging.getLogger(name) logging.basicConfig(level=logging.INFO)

class XLMRobertaLongForMaskedLM(XLMRobertaForMaskedLM): def init(self, config): super().init(config) for i, layer in enumerate(self.roberta.encoder.layer):

replace the modeling_bert.BertSelfAttention object with LongformerSelfAttention

        layer.attention.self = LongformerSelfAttention(config, layer_id=i)

def create_long_model(save_model_to, attention_window, max_pos): model = XLMRobertaForMaskedLM.from_pretrained('xlm-roberta-large') tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large", model_max_length=max_pos, use_fast=True) config = model.config

# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
config.max_position_embeddings = max_pos
assert max_pos > current_max_pos
# allocate a larger position embedding matrix
new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
    new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
    k += step
model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed

# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
config.attention_window = [attention_window] * config.num_hidden_layers
for i, layer in enumerate(model.roberta.encoder.layer):
    longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
    longformer_self_attn.query = layer.attention.self.query
    longformer_self_attn.key = layer.attention.self.key
    longformer_self_attn.value = layer.attention.self.value

    longformer_self_attn.query_global = layer.attention.self.query
    longformer_self_attn.key_global = layer.attention.self.key
    longformer_self_attn.value_global = layer.attention.self.value

    layer.attention.self = longformer_self_attn

logger.info(f'saving model to {save_model_to}')
model.save_pretrained(save_model_to)
tokenizer.save_pretrained(save_model_to)
return model, tokenizer

def copy_proj_layers(model): for i, layer in enumerate(model.roberta.encoder.layer): layer.attention.self.query_global = layer.attention.self.query layer.attention.self.key_global = layer.attention.self.key layer.attention.self.value_global = layer.attention.self.value return model

def pretrain_and_evaluate(args, model, tokenizer, eval_only, model_path): val_dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path=args.val_datapath, block_size=tokenizer.max_len) if eval_only: train_dataset = val_dataset else: logger.info(f'Loading and tokenizing training data is usually slow: {args.train_datapath}') train_dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path=args.train_datapath, block_size=tokenizer.max_len)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
trainer = Trainer(model=model, args=args, data_collator=data_collator,
                  train_dataset=train_dataset, eval_dataset=val_dataset, prediction_loss_only=True, )

eval_loss = trainer.evaluate()
eval_loss = eval_loss['eval_loss']
logger.info(f'Initial eval bpc: {eval_loss / math.log(2)}')

if not eval_only:
    trainer.train(model_path=model_path)
    trainer.save_model()

    eval_loss = trainer.evaluate()
    eval_loss = eval_loss['eval_loss']
    logger.info(f'Eval bpc after pretraining: {eval_loss / math.log(2)}')

@dataclass class ModelArgs: attention_window: int = field(default=512, metadata={"help": "Size of attention window"}) max_pos: int = field(default=4096, metadata={"help": "Maximum position"})

parser = HfArgumentParser((TrainingArguments, ModelArgs,))

training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[

'script.py',

'--output_dir', 'tmp',
'--warmup_steps', '500',
'--learning_rate', '0.00003',
'--weight_decay', '0.01',
'--adam_epsilon', '1e-6',
'--max_steps', '3000',
'--logging_steps', '500',
'--save_steps', '500',
'--max_grad_norm', '5.0',
'--per_gpu_eval_batch_size', '8',
'--per_gpu_train_batch_size', '1',  # 2 - 32GB gpu with fp32
#'--device', 'cuda0',  # one GPU
'--gradient_accumulation_steps', '32',
'--evaluate_during_training',
'--do_train',
'--do_eval',

]) training_args.val_datapath = 'wikitext-103-raw/wiki.valid.raw' training_args.train_datapath = 'wikitext-103-raw/wiki.train.raw'

model_path = f'{training_args.output_dir}/xlm-roberta-large-{model_args.max_pos}' if not os.path.exists(model_path): os.makedirs(model_path)

logger.info(f'Converting roberta-base into roberta-large-{model_args.max_pos}') model, tokenizer = create_long_model( save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)

logger.info(f'Loading the model from {model_path}') tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large", use_fast=True) model = XLMRobertaLongForMaskedLM.from_pretrained(model_path)

logger.info(f'Pretraining xlm-roberta-base-{model_args.max_pos} ... ')

training_args.max_steps = 3 ## <<<<<<<<<<<<<<<<<<<<<<<< REMOVE THIS <<<<<<<<<<<<<<<<<<<<<<<<

pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)

logger.info(f'Copying local projection layers into global projection layers ... ') model = copy_proj_layers(model) logger.info(f'Saving model to {model_path}') model.save_pretrained(model_path)

ibeltagy commented 4 years ago

I don't know which version of HF you have so can't be sure, but looks like the forward function of BertSelfAttention here has a different input format compared to LongformerSelfAttention here. You can implement a small class around LongformerSelfAttention that takes the input from BERT and convert it to the format expected in LongformerSelfAttention. We did the same thing when working on converting BART (check here).

davidhsv commented 4 years ago

Thanks for the response! Unfortunately, I'm a newbie in this area, I would love to have the best multilingual model in longformer, gonna subscribe for any news!

ibeltagy commented 4 years ago

This is a pretty easy issue to fix. Put a breakpoint here, then compare the parameters passed to self.self(...) with the arguments expected by LongformerSelfAttention here.

davidhsv commented 4 years ago

I tried to rerun the original pynb, just for mental healthiness, and I can't make it run. I tried in python 3.8 and 3.7, installing the pip install -r requirements.txt file on pycharm with conda env. I tried on google colab too, no luck there too.

Take a look here: https://colab.research.google.com/drive/1skFNZ1pil1YG6mzO8jLGE4L-AASGTN5E?usp=sharing

My main goal is to make it work first and replace to use the XLMRoberta. I think it will be a simple model change, because xlmroberta is the same as roberta.

Thank you for your help in advance!

ibeltagy commented 4 years ago

Thanks, @davidhsv for reporting this. Looks like the recent release of the HF code changed LongformerSelfAttention a bit making it less compatible with BertSelfAttention. I will fix the notebook soon and let you know.

davidhsv commented 4 years ago

Thanks! I really appreciate that :)

Sorry for not being able to contribute more, I'm a recovering java developer learning data science.

samru-rai commented 4 years ago

Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html

ibeltagy commented 4 years ago

@davidhsv, @samru-rai, fixed. Can you please try the notebook again and let me know if you run into any issues.

ibeltagy commented 4 years ago

how long it takes to pretrain XLM-R model?

@samru-rai, as mentioned in the notebook, you can still get a reasonable model even with zero pretraining. Additional pretraining definitely helps but for RoBERTa you get diminishing returns after processing around 800M tokens (around 2 days on a single GPU). With models other than RoBERTa, you will probably see the same general pattern but with different numbers.

CyndxAI commented 4 years ago

@ibeltagy regarding diminishing returns on additional pretraining, you mean in terms of improvements on the same pretraining corpus right, not e.g. domain-adaptive pretraining on a different corpus?

ibeltagy commented 4 years ago

@CyndxAI, good point, yes, if you are training the long version + adapting to a new domain, more training will be needed.

MarkusSagen commented 3 years ago

Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html

For me, training on a single GPU, with the same hyperparmeters, for 3000 iterations and with transformers 3.0.2, took 3 days and 11h.

Also Used fp16 and no gradient checkpointing

rplawate commented 3 years ago

Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html

For me, training on a single GPU, with the same hyperparmeters, for 3000 iterations and with transformers 3.0.2, took 3 days and 11h.

Also Used fp16 and no gradient checkpointing

@MarkusSagen Is there any chance you could share the pretrained multilingual longformer model and inference code?

MarkusSagen commented 3 years ago

@rplawate It depends, Im doing a master thesis at a company investigating if long context can be transferred to low-resource languages by extending the context of multilingual models and training in English only. If I get permision from the company, then yes, the aim is to release it to Higgingface. I settned with training for 6000 iterations, but the training and eval loss could be decreased further

rplawate commented 3 years ago

@markussagen interesting. hoping that you will opensource it. if it is of any help, i can provide resources for training it further than 6000 iterations.

MarkusSagen commented 3 years ago

@MarkusSagen interesting. hoping that you will opensource it. if it is of any help, i can provide resources for training it further than 6000 iterations.

@rplawate I've gotten the go ahead. Send me a mail and we can take it from there if it is still interesting

peakji commented 1 year ago

For those still interested, I've made a model initialized with XLM-RoBERTa's weights without further pretraining. The output of the long version should be identical to the original model for input sequences with lengths < 0.5 * attention window size.

As @ibeltagy mentioned earlier, the intermediate model produced by just copying the position embeddings and linear projects is already good enough to be fine-tuned on a downstream task.

The model could also be used as a starting point for pretraining on other languages, like what @MarkusSagen did with the English WikiText-103 corpus.

Variants of the model are available on Hugging Face model hub:

Model attention_window hidden_size num_hidden_layers model_max_length
base 256 768 12 16384
large 512 1024 24 16384

And the notebook for replicating the models is available here: https://github.com/hyperonym/dirge/blob/master/models/xlm-roberta-longformer/convert.ipynb - Instead of swapping the self-attention implementation of RoBERTa, the notebook started with a blank Longformer and copied the weights into it. It might be easier for converting other BERTology variants to their long versions.

ricardorei commented 8 months ago

@peakji Hey! thanks for sharing. When I click the notebook link I get a 404 error. Can you share it again?