Open erip opened 1 year ago
Thanks for the request.
How are you using xFormers currently? Are you building full models using xFormer.from_config(config)
, as shown in the documentation?
Yes, though I think you could (relatively simply) infer it from the serialized checkpoints as well. I think some bits might not be directly convertible (specialized attention schemes, for instance) but otherwise I think it could be a pretty light lift!
Usually the checkpoint is not enough to fully resolve the model architecture. We also need to know activations, the norm style (pre-norm vs. post-norm), etc. These information are usually not saved in the checkpoint.
A more general issue is that xFormers does not implement the input and output layers of the model. It means we can't provide a ready-to-use converter since it also depends on unknown user code. However, we can still provide a template or helper functions to process the xFormers model itself and let the user register the remaining modules.
I think the only layer xFormers doesn't implement is the output layer. I'd need to double check whether activations, etc. are in the checkpoint (I think they might be, but fused via triton), but a template would be super useful (where a user can give keys in the checkpoint to output layer weights, etc.)!
I think the only layer xFormers doesn't implement is the output layer.
Right, the word embedding layer is implemented in the "vocab" position embedding.
I also found they don't implement the final layer norm in the encoder/decoder when using the pre-norm residual style. We actually support that, but it's a difference with all other pre-norm implementations.
Anyway, here's a possible implementation of a xFormer converter: https://gist.github.com/guillaumekln/4761f65df1ce3e80f5969fd0f0a2c7f5
It gives an idea on how the conversion works and what are the supported encoder-decoder configurations. Please have a look at the TODOs and see if you can make it work for your model.
Hi @erip, did you have a chance to try a conversion?
xFormers is an optimized toolkit for highly configurable transformer-based encoder-decoder models. Adding support for inference through ctranslate2 would be highly useful to deploy xformers models.