Open Ross-Fan opened 2 weeks ago
same problem, use vllm and kaitchup/Phi-3-medium-4k-instruct-awq-4bit
same issue, I pushed the quantized model here: https://huggingface.co/bjaidi/Phi-3-medium-128k-instruct-awq
tested with vllm 0.4.2, also compared to gptq on vllm as well and that worked well: https://huggingface.co/Rakuto/Phi-3-medium-4k-instruct-gptq-4bit
phi-3-medium-128k-instruct was quantized by autoawq the quant-config:
then, run the generator.py as the following:
I changed the fuse_layer=False, because if not, the GPU can't load the model (A100 40GB)
so, any tips for this issue?