facebookresearch / xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.
https://facebookresearch.github.io/xformers/
Other
8.42k stars 596 forks source link

Porting Huggingface transformers using xformers #763

Open caleb5312 opened 1 year ago

caleb5312 commented 1 year ago

❓ Questions and Help

Hi all, I'm trying to load a pretrained XLM-Roberta model from HuggingFace using xformers to examine the potential speed up. To the best of my abilities, I've defined a config that mimics the XLM-Roberta architecture by using the microGPT example script. I then attempted to manually copy all weights from the existing XLM-Roberta version.

However, when I run inference on the same example, I get different results. Here's the code I've used in my trials:

from transformers import AutoConfig, AutoModelForTokenClassification, AutoTokenizer, pipeline
import torch

pretrained = 'cardiffnlp/xlm-roberta-base-sentiment-multilingual'

model_from_pretrained = AutoModelForTokenClassification.from_pretrained(pretrained)
tokenizer = AutoTokenizer.from_pretrained(pretrained)

import math
import os
import pytorch_lightning as pl
import threading
import torch.nn as nn
from pytorch_lightning import Trainer, seed_everything
from pytorch_lightning.utilities import rank_zero_info
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset, RandomSampler

from xformers.factory.model_factory import xFormer, xFormerConfig

class SXLMRoberta(pl.LightningModule):
    def __init__(
        self,
        hf_model,
        vocab_size,
        weight_decay=0.1,
        betas=(0.9, 0.95),
        learning_rate=6e-4,
        n_embd=768,
        block_size=514,
        n_layer=12,
        n_head=8,
        resid_pdrop=0.1,
        attn_pdrop=0.1,
        mlp_pdrop=0.1,
        attention="scaled_dot_product",
        hidden_layer_multiplier=4
    ):
        super().__init__()

        # auto creates self.hparams from the method signature
        self.save_hyperparameters()

        # A list of the encoder or decoder blocks which constitute the Transformer.
        xformer_config = [
            {
                "reversible": False,  # Turn on to test the effect of using reversible layers
                "block_type": "encoder",
                "num_layers": self.hparams.n_layer,
                "dim_model": self.hparams.n_embd,
                "residual_norm_style": "post",
                "position_encoding_config": {
                    "name": "vocab",
                    "seq_len": self.hparams.block_size,
                    "vocab_size": self.hparams.vocab_size,
                },
                "multi_head_config": {
                    "num_heads": self.hparams.n_head,
                    "residual_dropout": self.hparams.resid_pdrop,
                    "use_rotary_embeddings": True,
                    "attention": {
                        "name": self.hparams.attention,
                        "dropout": self.hparams.attn_pdrop,
                        "causal": True,
                        "seq_len": self.hparams.block_size,
                        "num_rules": self.hparams.n_head,
                    },
                },
                "feedforward_config": {
                    "name": "FusedMLP",  # Use MLP if Triton is not available
                    "dropout": self.hparams.mlp_pdrop,
                    "activation": "gelu",
                    "hidden_layer_multiplier": self.hparams.hidden_layer_multiplier,
                },
            }
        ]

        config = xFormerConfig(xformer_config)
        config.weight_init = "small"
        self.model = xFormer.from_config(config)

        # Assign the weights
        for i in range(self.hparams.n_layer):
            self.model._modules['encoders'][i].wrap_att.norm.weight = hf_model.roberta.encoder.layer[i].attention.output.LayerNorm.weight
            self.model._modules['encoders'][i].wrap_att.norm.bias = hf_model.roberta.encoder.layer[i].attention.output.LayerNorm.bias

            # Attention key, value and query
            self.model._modules['encoders'][i].wrap_att.sublayer.layer.in_proj_container.q_proj.weight = hf_model.roberta.encoder.layer[i].attention.self.query.weight
            self.model._modules['encoders'][i].wrap_att.sublayer.layer.in_proj_container.q_proj.bias = hf_model.roberta.encoder.layer[i].attention.self.query.bias
            self.model._modules['encoders'][i].wrap_att.sublayer.layer.in_proj_container.k_proj.weight = hf_model.roberta.encoder.layer[i].attention.self.key.weight
            self.model._modules['encoders'][i].wrap_att.sublayer.layer.in_proj_container.k_proj.bias = hf_model.roberta.encoder.layer[i].attention.self.key.bias
            self.model._modules['encoders'][i].wrap_att.sublayer.layer.in_proj_container.v_proj.weight = hf_model.roberta.encoder.layer[i].attention.self.value.weight
            self.model._modules['encoders'][i].wrap_att.sublayer.layer.in_proj_container.v_proj.bias = hf_model.roberta.encoder.layer[i].attention.self.value.bias

            self.model._modules['encoders'][i].wrap_att.sublayer.layer.proj.weight = hf_model.roberta.encoder.layer[i].attention.output.dense.weight
            self.model._modules['encoders'][i].wrap_att.sublayer.layer.proj.bias = hf_model.roberta.encoder.layer[i].attention.output.dense.bias

            self.model._modules['encoders'][i].wrap_ff.norm.weight = hf_model.roberta.encoder.layer[i].output.LayerNorm.weight
            self.model._modules['encoders'][i].wrap_ff.norm.bias = hf_model.roberta.encoder.layer[i].output.LayerNorm.bias

            self.model._modules['encoders'][i].wrap_ff.sublayer.layer.mlp[0].weight = hf_model.roberta.encoder.layer[i].intermediate.dense.weight
            self.model._modules['encoders'][i].wrap_ff.sublayer.layer.mlp[1].bias = hf_model.roberta.encoder.layer[i].intermediate.dense.bias

            self.model._modules['encoders'][i].wrap_ff.sublayer.layer.mlp[2].weight = hf_model.roberta.encoder.layer[i].output.dense.weight
            self.model._modules['encoders'][i].wrap_ff.sublayer.layer.mlp[3].bias = hf_model.roberta.encoder.layer[i].output.dense.bias

        self.dense = hf_model.classifier

        self.block_size = self.hparams.block_size
        self.apply(self._init_weights)

        self._tokens_seen = 0

    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

        # Reset the token counter
        self._tokens_seen = 0

    def get_block_size(self):
        return self.block_size

    def forward(self, src):
        prediction = self.model(src)
        sequence_output = prediction[:, 0, :]

        logits = self.dense(sequence_output)

        return logits

if __name__ == "__main__":
    model = SXLMRoberta(
        hf_model=model_from_pretrained,
        vocab_size=250002,
        attention="scaled_dot_product"
    )

    text = 'hi'
    tokenized = tokenizer([text])['input_ids']
    tokenized = torch.tensor(tokenized)
    model_output = model.forward(tokenized)

    softmax = nn.Softmax(dim=1)
    print('xformers:', softmax(model_output))

    hf_pipeline = pipeline(
        'text-classification',
        model=pretrained,
        tokenizer=tokenizer,
        device=0,
        top_k=None
    )
    print('pure hf:', hf_pipeline(text))

I'm using xformers==0.0.16, torch==1.13.1, transformers==4.27.4, triton==2.0.0.dev20221105.

danthe3rd commented 1 year ago

Hi,

from xformers.factory.model_factory import xFormer, xFormerConfig

So at the moment we're not really maintaining this xFormer Config thing, so might not be the best thing to do moving forward (plus it does not support memory-efficient attention).

Unfortunately I won't be able to help you with that (as I'm not familiar with this architecture, and it would take some time to reproduce it), but in general you can expect small numerics variations when changing between implementations

Lhemamou commented 1 year ago

I am trying also to reproduce a Roberta Model with xFormer. Did you check that the number of parameters match the ones when you load the model through from_pretrained function of hugging face ?

@danthe3rd " it does not support memory-efficient attention" -> Could you expand your answer on that, I am curious about it. Should we change the documentation to add a warning concerning xFormer Factory ?

Lhemamou commented 1 year ago

@caleb5312 Another problem is maybe the configuration of your multi-head module.

use_rotary_embeddings": True, "attention": { "name": self.hparams.attention, "dropout": self.hparams.attn_pdrop, "causal": True, "seq_len": self.hparams.block_size, "num_rules": self.hparams.n_head, },

Are you sure about the use of rotary embeddings in XLMRoberta and the causal setup (as it's probably a Bidirectionnel Encoder) ?