Quantize Llama-2-Chat Models with Weights and Activation-Quantization

OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

MIT License

626 stars 49 forks source link

Quantize Llama-2-Chat Models with Weights and Activation-Quantization #52

Closed DRXD1000 closed 6 months ago

DRXD1000 commented 6 months ago

Hi, I would be very grateful i there was an tutorial how to perform weights and activation quantization on the lama-2 chat models and save the models. the code i have used so far does not seem to work and i cannot find an explanation how to replicate the results.

ChenMnZ commented 6 months ago

The following command can quantize the model and save the really quantized models.

CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/MODELS  \
--epochs 20 --output_dir /PATH/TO/LOGS \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc \
--real_quant --save_dir /PATH/TO/SAVE

ChenMnZ commented 6 months ago

Additionally, the code only supports the really quantization for weight-only quantization. As for the weight-activation quantization, we just leverage fake quantization.