Closed justinthelaw closed 1 month ago
Example of TheBloke's model quantizations being outdated: https://github.com/vllm-project/vllm/issues/2422#issuecomment-1959439421
The configuration we pass to vLLM should not include quantization
, as that prevents automatic marlin_gptq quantization which uses a different algorithm to perform faster inferencing and less memory usage. Quantization is defined in all models' quantization_config.json.
Also, trust_remote_code
refers to the code downloaded as part of the model download, so this can be safely turned on as long as we review the extra Python scripts downloaded as part of the model download. These scripts usually just tell vLLM how to configure itself for inferencing the model architecture (e.g., Phi-3 GPTQ).
Above screenshots comparing (generally) Phi-3-mini-128k-instruct outperforming all other Mistral-7b-instruct variants.
Working on outside-spike to create a quantized version of Phi-3-mini-128k-instruct: https://github.com/justinthelaw/gptqmodel-pipeline
Describe what should be investigated or refactored
vLLM is currently not compatible with all GPTQ BFLOAT16 quanitzed models due to the dependency version (0.4.2). This needs to be upgraded to the next patch version (0.4.3), or just completely upgraded to the next minor version (0.5.2).
The following test model should work if this issue has been fixed (fits on RTX 4060 - 4090): https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json
Links to any relevant code
Example model that wouldn't work, but should: https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json
Issue related to the vLLM GPTQ BFLOAT16 PR: https://github.com/vllm-project/vllm/issues/2149
Additional context
This issue was confirmed when deploying Nous-Hermes-2-8x7b-DPO-GPTQ (8bit, 128g group size and Act Order) to an H100 GPU. Changing the
config.json
tofloat16
, despite imprecision, allows the model to be inferenced by vLLM.