Closed francescotaioli closed 1 week ago
Unfortunately, the --load-in-{4bit,8bit,smooth}
quants are broken in the rc_054 branch. If you have an Ampere (sm_80) or above GPU, you can do -q fp8
and it'll work the same, but much faster. Otherwise, you'll need to quantize the model with GPTQ or EETQ beforehand. Sorry for the inconvenience.
Ah right, I forgot to mention...
rc_054 branch has support for other on-the-fly quant methods as well; but they're not as performant at ~4bit range yet.
For bitsandbytes (NF4):
--load-format bitsandbytes -q bitsandbytes
For deepspeedfp (4bit, 6bit, 8bit, 12bit):
-q deepspeedfp --deepspeed_fp_bits 6
Thanks @AlpinDale for the answer.
Your current environment
🐛 Describe the bug
Installation from rc 054:
git clone -b rc_054 https://github.com/PygmalionAI/aphrodite-engine.git
Running
CUDA_VISIBLE_DEVICES="0,1" aphrodite run mistralai/Mistral-7B-Instruct-v0.3 --dtype=float16 --tensor-parallel-size=2 --gpu-memory-utilization 0.6 --load-in-8bit
causesAttributeError: 'QKVParallelLinear' object has no attribute 'state'
Same error with
CUDA_VISIBLE_DEVICES="0,1" aphrodite run mistralai/Mistral-7B-Instruct-v0.3 --dtype=float16 --tensor-parallel-size=1 --gpu-memory-utilization 0.6 --load-in-8bit
same error with different models (i.e llama 3.1)