allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.02k stars 271 forks source link

allenai `LongformerEncoderDecoderForConditionalGeneration` vs huggingface `LEDForConditionalGeneration` #192

Open EmilyAlsentzer opened 3 years ago

EmilyAlsentzer commented 3 years ago

I would like to be able to load a LED model into huggingface via e.g.

led = LEDForConditionalGeneration.from_pretrained('PATH/longformer-encdec-large-16384', gradient_checkpointing=True, use_cache=False)

However, I receive an AttributeError Can't set attention_probs_dropout_prob with value 0.0 for LEDConfig. This occurs with both my custom LED model that I converted via convert_bart_to_longformerencoderdecoder.py and with the LED model longformer-encdec-large-16384 downloaded directly from this repo. This occurs with HuggingFace transformers==4.5.1.

How can I use the LED models with the transformers library? I realize that there is a model "allenai/led-large-16384" in the repo already, but I would like to be able to use my custom LED model with the transformers codebase. Thanks!

Dmitriuso commented 3 years ago

Hey guys, I think I've got more or less the same question: I have been following your research on extended attention mechanism for transformer models and I am very excited with the results your models (Longformer & LED) achieve in abstractive summarization.

Still, it is a pity that such models exist mostly for English language for now. I, like many, had an idea of reusing the weights of pre-trained models, like mBART or BARThez and replace their attention mechanism with Longformer attention mechanism.

In fact, Longformer project offers a way to convert a model’s attention into Longformer, but as transformers library changed quite a lot since the 3.1.0 version required for your project, I get a mistake when I try to fine-tune the model on some new data (unknown argument hidden_state).

That is why I thought about replacing mBART self-attention with LEDEncoderAttention and LEDEncoderSelfAttention classes directly from modeling_led on transformers. The altered conversion script is here: https://pastebin.com/raw/Mw5ZQe3j. Converting this way + some tricks with the config, make the model load with LEDForConditionalGeneration and keep the weights of the pre-trained model. Still, the output is a deception, it just repeats itself. It looks like copying the positional embeddings is not enough. I wonder if this approach is viable at all and what should be the cause of such output.

Also, I wondered if you planned on publishing multilingual models with Longformer self-attention.

ibeltagy commented 3 years ago
Dmitriuso commented 3 years ago

@ibeltagy thanks a lot for your reply! Of course, I'll be glad to share the updated conversion script. Here is what I used for my experiments with mBART and BARThez, trying to replace their self_attn in the encoder with LEDEncoderAttention + LEDEncoderSelfAttention, and then launch it with LEDForConditionalGeneration:

import argparse
import logging
import os
import copy

from torch import nn
from transformers import AutoConfig, MBartConfig
from transformers import AutoTokenizer, MBartTokenizer, BarthezTokenizer
from transformers import AutoModelForSeq2SeqLM, MBartForConditionalGeneration, LEDForConditionalGeneration
from transformers.models.led.modeling_led import LEDEncoderAttention, LEDEncoderSelfAttention, LEDLearnedPositionalEmbedding
from transformers.models.mbart.modeling_mbart import MBartLearnedPositionalEmbedding

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

parser = argparse.ArgumentParser(
    description="Convert BART to LongBART. Replaces BART encoder's SelfAttnetion with LongformerSelfAttention")
parser.add_argument(
    '--base_model',
    type=str,
    default='facebook/mbart-large-cc25',
    help='The name or path of the base model you want to convert'
)
parser.add_argument(
    '--tokenizer_name_or_path',
    type=str,
    default='facebook/mbart-large-cc25',
    help='The name or path of the tokenizer'
)
parser.add_argument(
    '--save_model_to',
    type=str,
    required=True,
    default='./save',
    help='The path to save the converted model'
)
parser.add_argument(
    '--attention_window',
    type=int,
    default=1024,
    help='attention window size for longformer self attention (one sided)'
)
parser.add_argument(
    '--max_pos',
    type=int,
    default=4096 * 4,
    help='maximum encoder positions'
)

args = parser.parse_args()

class LongformerEncoderDecoderForConditionalGeneration(MBartForConditionalGeneration):
    def __init__(self, config):
        super().__init__(config)
        self.led.encoder.embed_positions = MBartLearnedPositionalEmbedding(args.max_pos, 768)
        self.led.config.attention_probs_dropout_prob = 0.0
        for i, layer in enumerate(self.led.encoder.layers):
            layer.self_attn = LEDEncoderAttention(config, layer_id=i)

def create_long_model(
    save_model_to,
    base_model,
    tokenizer_name_or_path,
    attention_window,
    max_pos
):
    model = AutoModelForSeq2SeqLM.from_pretrained(base_model)
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
    config = AutoConfig.from_pretrained(base_model)
    config.attention_mode = "sliding_chunks"
    config.autoregressive = False
    model.config = config

    # in BART attention_probs_dropout_prob is attention_dropout, but LongformerSelfAttention
    # expects attention_probs_dropout_prob, so set it here
    config.attention_probs_dropout_prob = config.attention_dropout
    config.architectures = ["LEDForConditionalGeneration"]

    # extend position embeddings
    tokenizer.model_max_length = max_pos
    tokenizer.init_kwargs['model_max_length'] = max_pos
    current_max_pos, embed_size = model.model.encoder.embed_positions.weight.shape
    assert current_max_pos == config.max_position_embeddings + 2

    config.max_encoder_position_embeddings = max_pos
    config.max_decoder_position_embeddings = config.max_position_embeddings
    del config.max_position_embeddings
    max_pos += 2  # NOTE: BART has positions 0,1 reserved, so embedding size is max position + 2
    assert max_pos >= current_max_pos

    # allocate a larger position embedding matrix for the encoder
    new_encoder_pos_embed = model.model.encoder.embed_positions.weight.new_empty(max_pos, embed_size)
    # copy position embeddings over and over to initialize the new position embeddings
    k = 2
    step = current_max_pos - 2
    while k < max_pos - 1:
        new_encoder_pos_embed[k:(k + step)] = model.model.encoder.embed_positions.weight[2:]
        k += step
    model.model.encoder.embed_positions.weight.data = new_encoder_pos_embed

    # allocate a larger position embedding matrix for the decoder
    # new_decoder_pos_embed = model.model.decoder.embed_positions.weight.new_empty(max_pos, embed_size)
    # # copy position embeddings over and over to initialize the new position embeddings
    # k = 2
    # step = current_max_pos - 2
    # while k < max_pos - 1:
    #     new_decoder_pos_embed[k:(k + step)] = model.model.decoder.embed_positions.weight[2:]
    #     k += step
    # model.model.decoder.embed_positions.weight.data = new_decoder_pos_embed

    # replace the `modeling_bart.SelfAttention` object with `LongformerSelfAttention`
    config.attention_window = [attention_window] * config.num_hidden_layers
    config.attention_dilation = [1] * config.num_hidden_layers

    for i, layer in enumerate(model.model.encoder.layers):
        led_encoder_attention = LEDEncoderAttention(config, layer_id=i)
        led_encoder_attention.longformer_self_attn = LEDEncoderSelfAttention(config, layer_id=i)

        led_encoder_attention.longformer_self_attn.query = layer.self_attn.q_proj
        # led_encoder_attention.longformer_self_attn.query.weight = layer.self_attn.q_proj.weight
        # led_encoder_attention.longformer_self_attn.query.bias = layer.self_attn.q_proj.bias

        led_encoder_attention.longformer_self_attn.key = layer.self_attn.k_proj
        # led_encoder_attention.longformer_self_attn.key.weight = layer.self_attn.k_proj.weight
        # led_encoder_attention.longformer_self_attn.key.bias = layer.self_attn.k_proj.bias

        led_encoder_attention.longformer_self_attn.value = layer.self_attn.v_proj
        # led_encoder_attention.longformer_self_attn.value.weight = layer.self_attn.v_proj.weight
        # led_encoder_attention.longformer_self_attn.value.bias = layer.self_attn.v_proj.bias

        led_encoder_attention.longformer_self_attn.query_global = copy.deepcopy(layer.self_attn.q_proj)
        # led_encoder_attention.longformer_self_attn.query_global.weight = copy.deepcopy(layer.self_attn.q_proj.weight)
        # led_encoder_attention.longformer_self_attn.query_global.bias = copy.deepcopy(layer.self_attn.q_proj.bias)

        led_encoder_attention.longformer_self_attn.key_global = copy.deepcopy(layer.self_attn.k_proj)
        # led_encoder_attention.longformer_self_attn.key_global.weight = copy.deepcopy(layer.self_attn.k_proj.weight)
        # led_encoder_attention.longformer_self_attn.key_global.bias = copy.deepcopy(layer.self_attn.k_proj.bias)

        led_encoder_attention.longformer_self_attn.value_global = copy.deepcopy(layer.self_attn.v_proj)
        # led_encoder_attention.longformer_self_attn.value_global.weight = copy.deepcopy(layer.self_attn.v_proj.weight)
        # led_encoder_attention.longformer_self_attn.value_global.bias = copy.deepcopy(layer.self_attn.v_proj.bias)

        led_encoder_attention.output = layer.self_attn.out_proj
        # led_encoder_attention.output.weight = layer.self_attn.out_proj.weight
        # led_encoder_attention.output.bias = layer.self_attn.out_proj.bias

        layer.self_attn = led_encoder_attention

    model.led = model.model
    logger.info(f'saving model to {save_model_to}')
    model.save_pretrained(save_model_to)
    tokenizer.save_pretrained(save_model_to)
    return model, tokenizer

def main():
    if not os.path.exists(args.save_model_to):
        os.mkdir(args.save_model_to)

    create_long_model(
        save_model_to=args.save_model_to,
        base_model=args.base_model,
        tokenizer_name_or_path=args.tokenizer_name_or_path,
        attention_window=args.attention_window,
        max_pos=args.max_pos
    )

    tokenizer = AutoTokenizer.from_pretrained(args.save_model_to)
    TXT = "Mes amis <mask> bien, mais ils mangent trop d'oeufs."
    longmodel = LEDForConditionalGeneration.from_pretrained(args.save_model_to)

if __name__ == "__main__":
    main()

The main changes are in for i, layer in enumerate(model.model.encoder.layers): part. At the beginning, I thought that I should copy weights and biases of the QKVs in order to keep the weights of the pre-trained model, but later I noticed that LED models named the model's parts in a different way, so I just introduced a new variable model.led = model.model. This way, and with the config here: https://pastebin.com/raw/f6Mvn5bi, the converted model is initialized with the pre-trained weights.

Still, the output is a disaster, it just repeats itself. It doesn't really get better when fine-tuned on custom data (I tried to fine-tune it for an abstractive summarization). I wonder if this approach ( LEDEncoderAttention + LEDEncoderSelfAttention) is viable at all and what should be the cause of such output.

edgartanaka commented 3 years ago

Hi @Dmitriuso. Thanks for posting your code! Did you have any luck in overcoming the problem of "the output is a disaster, it just repeats itself."?