Is FlexGen+GPTQ 4bit possible?

FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.

Apache License 2.0

9.18k stars 548 forks source link

FlexGen has support for 4-bit compression, see sec 5 in paper, and weights compression https://github.com/FMInference/FlexGen/blob/bbc9ea9670c496cd31dbb2c4b04e9a1337d82d53/flexgen/flex_opt.py#L1306 cache compression https://github.com/FMInference/FlexGen/blob/bbc9ea9670c496cd31dbb2c4b04e9a1337d82d53/flexgen/flex_opt.py#L1308
The compression in FlexGen has computation overhead, so it is not always better to turn it on. For large models like 175B which involves disk swap, it is usually better to turn on both weights and cache compression.
GPTQ 4bit has not been implemented in FlexGen.
Even you use 4bit, the weights of an 175B model need to occupy ~90G memory. 4GB vram and 50GB dram is not sufficient.

FMInference / FlexLLMGen