Open caleb5312 opened 1 year ago
Hi,
from xformers.factory.model_factory import xFormer, xFormerConfig
So at the moment we're not really maintaining this xFormer Config thing, so might not be the best thing to do moving forward (plus it does not support memory-efficient attention).
Unfortunately I won't be able to help you with that (as I'm not familiar with this architecture, and it would take some time to reproduce it), but in general you can expect small numerics variations when changing between implementations
I am trying also to reproduce a Roberta Model with xFormer. Did you check that the number of parameters match the ones when you load the model through from_pretrained function of hugging face ?
@danthe3rd " it does not support memory-efficient attention" -> Could you expand your answer on that, I am curious about it. Should we change the documentation to add a warning concerning xFormer Factory ?
@caleb5312 Another problem is maybe the configuration of your multi-head module.
use_rotary_embeddings": True, "attention": { "name": self.hparams.attention, "dropout": self.hparams.attn_pdrop, "causal": True, "seq_len": self.hparams.block_size, "num_rules": self.hparams.n_head, },
Are you sure about the use of rotary embeddings in XLMRoberta and the causal setup (as it's probably a Bidirectionnel Encoder) ?
❓ Questions and Help
Hi all, I'm trying to load a pretrained XLM-Roberta model from HuggingFace using xformers to examine the potential speed up. To the best of my abilities, I've defined a config that mimics the XLM-Roberta architecture by using the microGPT example script. I then attempted to manually copy all weights from the existing XLM-Roberta version.
However, when I run inference on the same example, I get different results. Here's the code I've used in my trials:
I'm using
xformers==0.0.16
,torch==1.13.1
,transformers==4.27.4
,triton==2.0.0.dev20221105
.