ModelCloud / GPTQModel

Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
Apache License 2.0
118 stars 26 forks source link

[BUG] Gemma 2 - 27B Regression #140

Closed Qubitium closed 3 months ago

Qubitium commented 4 months ago

PR https://github.com/ModelCloud/GPTQModel/pull/131 added Gemma 2 support but in our testing only 9B models are working. This ticket is to track the 27B inference issue post quantization.

Qubitium commented 4 months ago

27B model inference instability appears to be something unique to gemma 2. We have tested latest main/Transformers and even though reports of it fixed, 27B still cannot pass our tests.

Qubitium commented 3 months ago

Make sure we re-check this inference bug with latest transformers gemma 2 27B fixes in https://github.com/huggingface/transformers/releases/tag/v4.42.4

maxin9966 commented 3 months ago

@Qubitium There is no GPTQ model for gemma-2-27b on HuggingFace. Could you help upload a version?

Qubitium commented 3 months ago

@maxin9966 We have quanted the 27b model but inference did not pass our ppl quality test so we did not upload it. There is something special/wrong with the 27b model that is breaking inference.

sparsh35 commented 3 months ago

Are you using Vllm for inference , then it may be attention , for gemma 2 27 you need flash infer for attention

Qubitium commented 3 months ago

@LRL-ModelCloud Re-test 27B inference ppl with vllm with flash infer, and sglang with flash infer (lastest).

maxin9966 commented 3 months ago

@Qubitium Thank you very much. I would like to know if the 27b-gptq model has passed the test when using flash infer.

Qubitium commented 3 months ago

@maxin9966 It's the weekend so we will re-test next week.

Qubitium commented 3 months ago

Gemma 2 27B model download: https://huggingface.co/ModelCloud/gemma-2-27b-it-gptq-4bit

vllm/sglang fixed gemma 2 27b inference

yechenzhi commented 3 months ago

Can we use GPTQmodel to quantize our gemma-2-27B finetuned models? For example ,I fine-tune a model based on gemma-2-27b-it, then use GPTQmodel to quantize it to 4-bit, and finally use vLLM to do inference, is this ok?

Qubitium commented 3 months ago

@yechenzhi Absolutely. https://huggingface.co/ModelCloud/gemma-2-27b-it-gptq-4bit was quantized with GPTQModel and you can inference using GPTQModel and pass backend=BACKEND.VLLM in from_quantized for vllm inference.

maxin9966 commented 3 months ago

@Qubitium Thank you very much. 👍

I have a few more questions. Could you please guide me?

  1. Between sglang and vllm, which one is more suitable for a production environment? How much better is the throughput of sglang compared to vllm in general? Can sglang's stability support a production environment?
  2. Between flashinfer and flashattn2, which one performs better?
Qubitium commented 3 months ago

@maxin9966

  1. You have to benchmark these yourself. Both are fast, and depending on model, one is faster than the other. They are verry different when it comes to kv cachig and re-use.
  2. Again, you have to benchmark this yourself.
king398 commented 3 months ago

Could you please share your script for quantizing gemma 27b ? I have been trying to quantize it but I keep getting the error torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite)

Qubitium commented 3 months ago

Could you please share your script for quantizing gemma 27b ? I have been trying to quantize it but I keep getting the error torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite)

@king398 Do the following:

  1. Increase damp value
  2. Make sure you have at least 512/1024 rows of data in calibration.
maxin9966 commented 3 months ago

@Qubitium Why does running gemma-2-it-27b-gptq through vllm produce all outputs as , I have tested different versions of flashinfer and vllm, and the results are the same.

vllm:

VLLM_ATTENTION_BACKEND=FLASHINFER CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --model ModelCloud/gemma-2-27b-it-gptq-4bit --gpu-memory-utilization 0.9 --quantization gptq --host 0.0.0.0 --port 1231 -tp 1 --dtype float16 --served-model-name gpt --trust-remote-code --enable-prefix-caching --enforce-eager

output:

[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='

maxin9966 commented 3 months ago

image

Qubitium commented 3 months ago

Usually when I see this, the first thing I would look at is your tokenizer. Check you have latest transformers with tokenizer fixes and latest vllm. Lastly, check your bos, eos, pad token are alll correct as relates to gemma 2 it model. If any of the bos/eos/pad token is incorrectly set then you get bad output.

sparsh35 commented 3 months ago

It could also be the issue of the dtype used , it was trained with bfloat 16 by google and known to produce these outputs when using any other dtype like float 16

Qubitium commented 3 months ago

Like @sparsh35 noted, if model is bf16 and you use fp16, there is a conversion loss.

maxin9966 commented 3 months ago

@Qubitium VLLM_ATTENTION_BACKEND=FLASHINFER CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --model ModelCloud/gemma-2-27b-it-gptq-4bit --gpu-memory-utilization 0.9 --quantization gptq --host 0.0.0.0 --port 1231 -tp 1 --dtype bfloat16 --served-model-name gpt --trust-remote-code --enable-prefix-caching --enforce-eager

error: ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16]

Qubitium commented 3 months ago

@maxin9966 You're right. gptq models are only fp16 capable. We will check what's going with vllm and this gemma-2-27b quant.