OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.17k stars 280 forks source link

Support for Zephyr and other "StableLmForCausalLM" models? #1649

Open BBC-Esq opened 5 months ago

BBC-Esq commented 5 months ago

Any plans to support conversion of ```StableLmForCausalLM" models? I've noticed that they're very good; for example the new Zephyr model here:

https://huggingface.co/stabilityai/stablelm-zephyr-3b

Amazing performance for a 3B model, much better than Phi2 IMHO. Support was added into Transformers in version 4.38.2:

https://github.com/huggingface/transformers/releases/tag/v4.38.0

Here's the link to a description of the model architecture to help:

https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/stablelm

BBC-Esq commented 4 months ago

Here is yet another badass model @minhthuc2502 . Would love to help create a converter but am not an expert. It's the 1.6b version of Zephyr:

https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b

It kicks ass for its size. The only other small models with a context size of over 4,000 is gemma, which, at least in my testing, royally sucks. (referring to Gemma 2b, newest version 1.1 included).

Currently, the only reasonable option to build a chat application with ctranslate2 that uses a model smaller than 7b requires using gemma. I use the term "reasonable" because the phi converter is currently broke due to changes in phi2, and, at any rate, phi2 only has a context of 2048.

Zephyr 3b and Zephyr 1.6b are the best in their class, way better than gemma 2b. Other viable options are creating a converter for Qwen, which has a .5B model, actually.

Here are tests for gemma and others on a basic RAG question. gemma 2b only got half the question right no matter how many beams I used. HOWEVER, even the zephyr 1.6B model gave a 100% correct answer at beam size of 1.

In short, gemma 2b is fast, but sucks, while zephyr is only slightly less fast, but IS ABSOLUTELY AWESOME.

NOTE: The models in the legend with "ct2" in their name are obviously ctranslate2 models. The other models were tested using transformers along with bitsandbytes (using 4-bit), just FYI.

Lastly, llama.cpp already supports zephyr, qwen and others, but I'd rather not switch due to the additional dependency...Let me know @minhthuc2502 if you'll reconsider making this a higher priority. I know you're busy...thanks dude.

image

BBC-Esq commented 4 months ago

To maybe save you a few minutes..I've gathered the following information for someone/anyone:

1) The config.json states that the architecture is "StableLmForCausalLM"

2) I think this is it https://huggingface.co/docs/transformers/v4.40.0/en/model_doc/stablelm

3) Additional info: https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo

Based on this snippet, hopefully it wouldn't be too complicated to create a converter for it...

image