Closed Qubitium closed 3 months ago
27B model inference instability appears to be something unique to gemma 2. We have tested latest main/Transformers and even though reports of it fixed, 27B still cannot pass our tests.
Make sure we re-check this inference bug with latest transformers gemma 2 27B fixes in https://github.com/huggingface/transformers/releases/tag/v4.42.4
@Qubitium There is no GPTQ model for gemma-2-27b on HuggingFace. Could you help upload a version?
@maxin9966 We have quanted the 27b model but inference did not pass our ppl quality test so we did not upload it. There is something special/wrong with the 27b model that is breaking inference.
Are you using Vllm for inference , then it may be attention , for gemma 2 27 you need flash infer for attention
@LRL-ModelCloud Re-test 27B inference ppl with vllm with flash infer, and sglang with flash infer (lastest).
@Qubitium Thank you very much. I would like to know if the 27b-gptq model has passed the test when using flash infer.
@maxin9966 It's the weekend so we will re-test next week.
Gemma 2 27B model download: https://huggingface.co/ModelCloud/gemma-2-27b-it-gptq-4bit
vllm/sglang fixed gemma 2 27b inference
Can we use GPTQmodel to quantize our gemma-2-27B finetuned models? For example ,I fine-tune a model based on gemma-2-27b-it, then use GPTQmodel to quantize it to 4-bit, and finally use vLLM to do inference, is this ok?
@yechenzhi Absolutely. https://huggingface.co/ModelCloud/gemma-2-27b-it-gptq-4bit was quantized with GPTQModel and you can inference using GPTQModel and pass backend=BACKEND.VLLM
in from_quantized
for vllm inference.
@Qubitium Thank you very much. 👍
I have a few more questions. Could you please guide me?
@maxin9966
Could you please share your script for quantizing gemma 27b ? I have been trying to quantize it but I keep getting the error
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite)
Could you please share your script for quantizing gemma 27b ? I have been trying to quantize it but I keep getting the error
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite)
@king398 Do the following:
@Qubitium Why does running gemma-2-it-27b-gptq through vllm produce all outputs as
VLLM_ATTENTION_BACKEND=FLASHINFER CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --model ModelCloud/gemma-2-27b-it-gptq-4bit --gpu-memory-utilization 0.9 --quantization gptq --host 0.0.0.0 --port 1231 -tp 1 --dtype float16 --served-model-name gpt --trust-remote-code --enable-prefix-caching --enforce-eager
[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='
Usually when I see this, the first thing I would look at is your tokenizer. Check you have latest transformers with tokenizer fixes and latest vllm. Lastly, check your bos, eos, pad token are alll correct as relates to gemma 2 it model. If any of the bos/eos/pad token is incorrectly set then you get bad output.
It could also be the issue of the dtype used , it was trained with bfloat 16 by google and known to produce these outputs when using any other dtype like float 16
Like @sparsh35 noted, if model is bf16 and you use fp16, there is a conversion loss.
@Qubitium VLLM_ATTENTION_BACKEND=FLASHINFER CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --model ModelCloud/gemma-2-27b-it-gptq-4bit --gpu-memory-utilization 0.9 --quantization gptq --host 0.0.0.0 --port 1231 -tp 1 --dtype bfloat16 --served-model-name gpt --trust-remote-code --enable-prefix-caching --enforce-eager
error: ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16]
@maxin9966 You're right. gptq models are only fp16
capable. We will check what's going with vllm and this gemma-2-27b quant.
PR https://github.com/ModelCloud/GPTQModel/pull/131 added Gemma 2 support but in our testing only 9B models are working. This ticket is to track the 27B inference issue post quantization.