Closed puppetm4st3r closed 1 month ago
Please remove the --quantization awq
part and try again.
will try,
@AlpinDale nope, same stack trace, Also checked with an awq non mixtral moe and works like a charm inclusive with the --quantization awq + load in 4 bits, I was able to notice a significant increase in the generation of tokens compared to loading the awq without the load in 4 bits parameter
@AlpinDale nope, same stack trace, Also checked with an awq non mixtral moe and works like a charm inclusive with the --quantization awq + load in 4 bits, I was able to notice a significant increase in the generation of tokens compared to loading the awq without the load in 4 bits parameter
How large is the increase in speed when adding the load-in-4-bit to awq models, and did you notice it on all models? Also, does it affect generation quality at all?
@SalomonKisters in a 4090 I would say that the increase of speed it is noticeable to the naked eye (did not benchmarked yet)
@SalomonKisters in a 4090 I would say that the increase of speed it is noticeable to the naked eye (did not benchmarked yet)
Okay, sounds nice. So you just use AWQ quantized models with "--quantization awq --load-in-4bit"?
yes, but is not working for Moe's, for moes gptq is the best option now I think...
yes, but is not working for Moe's, for moes gptq is the best option now I think...
Ah makes sense, but for GPTQs it doesnt work combined with --load-in-4bit, right?
yes, but is not working for Moe's, for moes gptq is the best option now I think...
Ah makes sense, but for GPTQs it doesnt work combined with --load-in-4bit, right?
right
Any update ? I have the same error Message
As of v0.6.0, --load-in-{4bit,8bit,smooth} args are removed. Please use -q fp8
instead.
Your current environment
Thats the output of my host (i 'm running the engine with the official docker image)
🐛 Describe the bug
When I try to load AWQ quant model with --load-in-4bits and the model is a Mixtral kind moe it throw the following stack trace:
entry point command executed inside the docker:
python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 3000 --download-dir /data/hub --model macadeliccc/laser-dolphin-mixtral-4x7b-dpo-AWQ --dtype float16 --kv-cache-dtype fp8_e5m2 --max-model-len 12000 --tensor-parallel-size 2 --gpu-memory-utilization .98 --enforce-eager --block-size 8 --max-paddings 512 --port 3000 --swap-space 10 --chat-template /home/workspace/chat_templates/chat_ml.jinja --served-model-name dolf --max-context-len-to-capture 512 --max-num-batched-tokens 32000 --max-num-seqs 62 --quantization awq --load-in-4bit