Open EmilyAlsentzer opened 3 years ago
Hey guys, I think I've got more or less the same question: I have been following your research on extended attention mechanism for transformer models and I am very excited with the results your models (Longformer & LED) achieve in abstractive summarization.
Still, it is a pity that such models exist mostly for English language for now. I, like many, had an idea of reusing the weights of pre-trained models, like mBART or BARThez and replace their attention mechanism with Longformer attention mechanism.
In fact, Longformer project offers a way to convert a model’s attention into Longformer, but as transformers library changed quite a lot since the 3.1.0 version required for your project, I get a mistake when I try to fine-tune the model on some new data (unknown argument hidden_state).
That is why I thought about replacing mBART self-attention with LEDEncoderAttention and LEDEncoderSelfAttention classes directly from modeling_led on transformers. The altered conversion script is here: https://pastebin.com/raw/Mw5ZQe3j. Converting this way + some tricks with the config, make the model load with LEDForConditionalGeneration and keep the weights of the pre-trained model. Still, the output is a deception, it just repeats itself. It looks like copying the positional embeddings is not enough. I wonder if this approach is viable at all and what should be the cause of such output.
Also, I wondered if you planned on publishing multilingual models with Longformer self-attention.
LongformerEncoderDecoderForConditionalGeneration
checkpoint is not perfectly compatible with the HF LEDForConditionalGeneration
but the differences are small. The main difference is that LongformerEncoderDecoderForConditionalGeneration.config.attention_window
is half LEDForConditionalGeneration.config.attention_window
convert_bart_to_longformerencoderdecoder.py
script to convert a HF bart checkpoint to a HF LED checkpoint. @Dmitriuso, if you can post your updated script on github and highlight the changes to make it more readable, I will be happy to review. @ibeltagy thanks a lot for your reply! Of course, I'll be glad to share the updated conversion script. Here is what I used for my experiments with mBART and BARThez, trying to replace their self_attn
in the encoder with LEDEncoderAttention
+ LEDEncoderSelfAttention
, and then launch it with LEDForConditionalGeneration
:
import argparse
import logging
import os
import copy
from torch import nn
from transformers import AutoConfig, MBartConfig
from transformers import AutoTokenizer, MBartTokenizer, BarthezTokenizer
from transformers import AutoModelForSeq2SeqLM, MBartForConditionalGeneration, LEDForConditionalGeneration
from transformers.models.led.modeling_led import LEDEncoderAttention, LEDEncoderSelfAttention, LEDLearnedPositionalEmbedding
from transformers.models.mbart.modeling_mbart import MBartLearnedPositionalEmbedding
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
parser = argparse.ArgumentParser(
description="Convert BART to LongBART. Replaces BART encoder's SelfAttnetion with LongformerSelfAttention")
parser.add_argument(
'--base_model',
type=str,
default='facebook/mbart-large-cc25',
help='The name or path of the base model you want to convert'
)
parser.add_argument(
'--tokenizer_name_or_path',
type=str,
default='facebook/mbart-large-cc25',
help='The name or path of the tokenizer'
)
parser.add_argument(
'--save_model_to',
type=str,
required=True,
default='./save',
help='The path to save the converted model'
)
parser.add_argument(
'--attention_window',
type=int,
default=1024,
help='attention window size for longformer self attention (one sided)'
)
parser.add_argument(
'--max_pos',
type=int,
default=4096 * 4,
help='maximum encoder positions'
)
args = parser.parse_args()
class LongformerEncoderDecoderForConditionalGeneration(MBartForConditionalGeneration):
def __init__(self, config):
super().__init__(config)
self.led.encoder.embed_positions = MBartLearnedPositionalEmbedding(args.max_pos, 768)
self.led.config.attention_probs_dropout_prob = 0.0
for i, layer in enumerate(self.led.encoder.layers):
layer.self_attn = LEDEncoderAttention(config, layer_id=i)
def create_long_model(
save_model_to,
base_model,
tokenizer_name_or_path,
attention_window,
max_pos
):
model = AutoModelForSeq2SeqLM.from_pretrained(base_model)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
config = AutoConfig.from_pretrained(base_model)
config.attention_mode = "sliding_chunks"
config.autoregressive = False
model.config = config
# in BART attention_probs_dropout_prob is attention_dropout, but LongformerSelfAttention
# expects attention_probs_dropout_prob, so set it here
config.attention_probs_dropout_prob = config.attention_dropout
config.architectures = ["LEDForConditionalGeneration"]
# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.model.encoder.embed_positions.weight.shape
assert current_max_pos == config.max_position_embeddings + 2
config.max_encoder_position_embeddings = max_pos
config.max_decoder_position_embeddings = config.max_position_embeddings
del config.max_position_embeddings
max_pos += 2 # NOTE: BART has positions 0,1 reserved, so embedding size is max position + 2
assert max_pos >= current_max_pos
# allocate a larger position embedding matrix for the encoder
new_encoder_pos_embed = model.model.encoder.embed_positions.weight.new_empty(max_pos, embed_size)
# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
new_encoder_pos_embed[k:(k + step)] = model.model.encoder.embed_positions.weight[2:]
k += step
model.model.encoder.embed_positions.weight.data = new_encoder_pos_embed
# allocate a larger position embedding matrix for the decoder
# new_decoder_pos_embed = model.model.decoder.embed_positions.weight.new_empty(max_pos, embed_size)
# # copy position embeddings over and over to initialize the new position embeddings
# k = 2
# step = current_max_pos - 2
# while k < max_pos - 1:
# new_decoder_pos_embed[k:(k + step)] = model.model.decoder.embed_positions.weight[2:]
# k += step
# model.model.decoder.embed_positions.weight.data = new_decoder_pos_embed
# replace the `modeling_bart.SelfAttention` object with `LongformerSelfAttention`
config.attention_window = [attention_window] * config.num_hidden_layers
config.attention_dilation = [1] * config.num_hidden_layers
for i, layer in enumerate(model.model.encoder.layers):
led_encoder_attention = LEDEncoderAttention(config, layer_id=i)
led_encoder_attention.longformer_self_attn = LEDEncoderSelfAttention(config, layer_id=i)
led_encoder_attention.longformer_self_attn.query = layer.self_attn.q_proj
# led_encoder_attention.longformer_self_attn.query.weight = layer.self_attn.q_proj.weight
# led_encoder_attention.longformer_self_attn.query.bias = layer.self_attn.q_proj.bias
led_encoder_attention.longformer_self_attn.key = layer.self_attn.k_proj
# led_encoder_attention.longformer_self_attn.key.weight = layer.self_attn.k_proj.weight
# led_encoder_attention.longformer_self_attn.key.bias = layer.self_attn.k_proj.bias
led_encoder_attention.longformer_self_attn.value = layer.self_attn.v_proj
# led_encoder_attention.longformer_self_attn.value.weight = layer.self_attn.v_proj.weight
# led_encoder_attention.longformer_self_attn.value.bias = layer.self_attn.v_proj.bias
led_encoder_attention.longformer_self_attn.query_global = copy.deepcopy(layer.self_attn.q_proj)
# led_encoder_attention.longformer_self_attn.query_global.weight = copy.deepcopy(layer.self_attn.q_proj.weight)
# led_encoder_attention.longformer_self_attn.query_global.bias = copy.deepcopy(layer.self_attn.q_proj.bias)
led_encoder_attention.longformer_self_attn.key_global = copy.deepcopy(layer.self_attn.k_proj)
# led_encoder_attention.longformer_self_attn.key_global.weight = copy.deepcopy(layer.self_attn.k_proj.weight)
# led_encoder_attention.longformer_self_attn.key_global.bias = copy.deepcopy(layer.self_attn.k_proj.bias)
led_encoder_attention.longformer_self_attn.value_global = copy.deepcopy(layer.self_attn.v_proj)
# led_encoder_attention.longformer_self_attn.value_global.weight = copy.deepcopy(layer.self_attn.v_proj.weight)
# led_encoder_attention.longformer_self_attn.value_global.bias = copy.deepcopy(layer.self_attn.v_proj.bias)
led_encoder_attention.output = layer.self_attn.out_proj
# led_encoder_attention.output.weight = layer.self_attn.out_proj.weight
# led_encoder_attention.output.bias = layer.self_attn.out_proj.bias
layer.self_attn = led_encoder_attention
model.led = model.model
logger.info(f'saving model to {save_model_to}')
model.save_pretrained(save_model_to)
tokenizer.save_pretrained(save_model_to)
return model, tokenizer
def main():
if not os.path.exists(args.save_model_to):
os.mkdir(args.save_model_to)
create_long_model(
save_model_to=args.save_model_to,
base_model=args.base_model,
tokenizer_name_or_path=args.tokenizer_name_or_path,
attention_window=args.attention_window,
max_pos=args.max_pos
)
tokenizer = AutoTokenizer.from_pretrained(args.save_model_to)
TXT = "Mes amis <mask> bien, mais ils mangent trop d'oeufs."
longmodel = LEDForConditionalGeneration.from_pretrained(args.save_model_to)
if __name__ == "__main__":
main()
The main changes are in for i, layer in enumerate(model.model.encoder.layers):
part. At the beginning, I thought that I should copy weights and biases of the QKVs in order to keep the weights of the pre-trained model, but later I noticed that LED
models named the model's parts in a different way, so I just introduced a new variable model.led = model.model
. This way, and with the config here: https://pastebin.com/raw/f6Mvn5bi, the converted model is initialized with the pre-trained weights.
Still, the output is a disaster, it just repeats itself. It doesn't really get better when fine-tuned on custom data (I tried to fine-tune it for an abstractive summarization). I wonder if this approach ( LEDEncoderAttention
+ LEDEncoderSelfAttention
) is viable at all and what should be the cause of such output.
Hi @Dmitriuso. Thanks for posting your code! Did you have any luck in overcoming the problem of "the output is a disaster, it just repeats itself."?
I would like to be able to load a LED model into huggingface via e.g.
led = LEDForConditionalGeneration.from_pretrained('PATH/longformer-encdec-large-16384', gradient_checkpointing=True, use_cache=False)
However, I receive an AttributeError
Can't set attention_probs_dropout_prob with value 0.0 for LEDConfig
. This occurs with both my custom LED model that I converted viaconvert_bart_to_longformerencoderdecoder.py
and with the LED modellongformer-encdec-large-16384
downloaded directly from this repo. This occurs with HuggingFace transformers==4.5.1.How can I use the LED models with the transformers library? I realize that there is a model
"allenai/led-large-16384"
in the repo already, but I would like to be able to use my custom LED model with the transformers codebase. Thanks!