defenseunicorns / leapfrogai

Production-ready Generative AI for local, cloud native, airgap, and edge deployments.
https://leapfrog.ai
Apache License 2.0
257 stars 28 forks source link

feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility #835

Closed justinthelaw closed 1 month ago

justinthelaw commented 3 months ago

Describe what should be investigated or refactored

vLLM is currently not compatible with all GPTQ BFLOAT16 quanitzed models due to the dependency version (0.4.2). This needs to be upgraded to the next patch version (0.4.3), or just completely upgraded to the next minor version (0.5.2).

The following test model should work if this issue has been fixed (fits on RTX 4060 - 4090): https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json

Links to any relevant code

Example model that wouldn't work, but should: https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json

Issue related to the vLLM GPTQ BFLOAT16 PR: https://github.com/vllm-project/vllm/issues/2149

Additional context

This issue was confirmed when deploying Nous-Hermes-2-8x7b-DPO-GPTQ (8bit, 128g group size and Act Order) to an H100 GPU. Changing the config.json to float16, despite imprecision, allows the model to be inferenced by vLLM.

justinthelaw commented 3 months ago

Example of TheBloke's model quantizations being outdated: https://github.com/vllm-project/vllm/issues/2422#issuecomment-1959439421

justinthelaw commented 3 months ago

The configuration we pass to vLLM should not include quantization, as that prevents automatic marlin_gptq quantization which uses a different algorithm to perform faster inferencing and less memory usage. Quantization is defined in all models' quantization_config.json.

Also, trust_remote_code refers to the code downloaded as part of the model download, so this can be safely turned on as long as we review the extra Python scripts downloaded as part of the model download. These scripts usually just tell vLLM how to configure itself for inferencing the model architecture (e.g., Phi-3 GPTQ).

justinthelaw commented 3 months ago

Screenshot 2024-07-30 103756 Screenshot 2024-07-30 103906

Above screenshots comparing (generally) Phi-3-mini-128k-instruct outperforming all other Mistral-7b-instruct variants.

Working on outside-spike to create a quantized version of Phi-3-mini-128k-instruct: https://github.com/justinthelaw/gptqmodel-pipeline