OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.41k stars 303 forks source link

ctranslate2 compared to ggml/gguf and gptq #1486

Closed BBC-Esq closed 1 year ago

BBC-Esq commented 1 year ago

Hello again, I'd like to start creating chatbots using ctranslate2 models that eventually work with embedding models (also converted to the ctranslate2 format).

Currently, the smallest ctranslate2 offers is int8, int8_float32, int8_float16, and int8_bfloat16. I'm aware of your comment that ctranslate2 does not offer lower than 8-bit quants: https://github.com/OpenNMT/CTranslate2/issues/1104#issuecomment-1462381140.

Here you can see that I tried to compare ctranslate2 versus ggml/gguf: https://huggingface.co/ctranslate2-4you/Llama-2-7b-chat-hf-ct2-int8/resolve/main/comparison%20of%20ctranslate2%20and%20ggml.png. Ctranslate2 beats the equivalent ggml/gguf model in terms of VRAM/RAM usage. However, my graphic does not compare inference speed. Also, my limited understanding is that anytime quantization occurs "perplexity" is affected...and my graphic doesn't compare "perplexity" either.

My question is, firstly, am I in-fact comparing apples to apples? My comparison equates ctranslate's "int8" with ggml/gguf's "Q8" quant, but I'm unsure if this is accurate? (ignore my mistake that llama2 is originally in float32 by the way...) As you can see, ctranslate2 destroys ggml/gguf...even ctranslate2's int8 quant uses less vram than ggml/gguf's 3_k_s...wow.

Anyways, assuming I am comparing apples to apples, are you aware of the speed and/or "perplexity" differences between comparable ggml/gguf and ctranslate2 converted model?

For example, I'm only aware of a rough comparison between ggml/gguf and GPTQ...See here: https://huggingface.co/TheBloke/wizardLM-7B-GGML/discussions/3.

My intuition tells me that ctranslate2 beats ggml/gguf on all 3 metrics: vram/ram, speed, and "perplexity." I've been smitten by ctranslate2 since I saw it absolutely destroy llama.cpp's implementation of Whisper. See here: https://github.com/guillaumekln/faster-whisper.

However, before I spend a lot of time (which I don't mind doing) I'm trying to get an accurate idea of how it compares to ggml/gguf (and gptq for that matter).

As it currently stands, assuming that a person uses a model having an architecture that ctranslate2 supports, it seems like they should always use ctranslate2 rather than ggml/gguf/gptq. It seems to just be an issue of ease of implementation...which I hope to do my part to address.

BBC-Esq commented 1 year ago

Hello guillaumekln, any information on my questions above by chance? Before I spend a lt of time on this I'm trying to gather some information and you seem to be the primary person spearheading this project. Thanks!

guillaumekln commented 1 year ago

I don't know enough about GGML or GPTQ to answer.

The only related comparison I conducted was faster-whisper (CTranslate2) vs. whisper.cpp (GGML), but this is a particular case. The Whisper model uses beam search which is known to be poorly optimized in whisper.cpp. Most language models are not executed with beam search. However, whisper.cpp would typically be much faster on Macbooks.

Regarding CTranslate2 vs GPTQ, this paper reported that CTranslate2 int8 is better than GPTQ int8 on all metrics: memory usage, speed, human evaluation.

BBC-Esq commented 1 year ago

Thanks for the info, very informative.