OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
626 stars 49 forks source link

Quantize LLAMA-2-7b-chat to W4A4 #37

Open nmyuchen opened 7 months ago

nmyuchen commented 7 months ago

Hello, and thank you for your efforts! I encountered an issue while attempting to quantize the LLAMA-2-7b-chat model to W4A4. I utilized the command below.

CUDA_VISIBLE_DEVICES=0 python main.py \
--model meta-llama/Llama-2-7b-chat-hf --eval_ppl \
--epochs 20 --output_dir ./log/Llama-2-7b-chat-w4a4 \
--wbits 4 --abits 4 --lwc --let \
--let_lr 1e-3 --alpha 0.75

However, the outcome was not as expected. The perplexity (PPL) on the Wikitext-2 dataset was only 37, which is unsatisfactory. Additional results are provided below.

INFO load calibration from ./cache/testloader_Llama_wikitext2_all.cache
INFO wikitext2 : 37.00777053833008
INFO load calibration from ./cache/testloader_Llama_ptb_all.cache
INFO ptb : 150.4561767578125
INFO load calibration from ./cache/testloader_Llama_c4_all.cache
INFO c4 : 46.19054412841797
INFO load calibration from ./cache/testloader_Llama_ptb-new_all.cache
INFO ptb-new : 572.8397216796875
INFO load calibration from ./cache/testloader_Llama_c4-new_all.cache
INFO c4-new : 50.049354553222656

Could you please offer some guidance on adjusting the hyper-parameters for "Llama-2-7b-chat' to achieve results comparable to your 'Llama-2-7b-w4a4' model? Your assistance would be greatly appreciated. Thank you.

ChenMnZ commented 7 months ago

@nmyuchen You can try to set the --epochs as 40, which can significantly improve the performance of LLaMa-2-7B W4A4.

Also, I will try to find the appropriate parameter for training with 20 epochs.

nmyuchen commented 7 months ago

@ChenMnZ Thank you for your response. May I further consult with you about the recommended hyperparameter settings for W8A8 for "Llama-2-7b-chat"?

ChenMnZ commented 7 months ago

For Llama-2-7b-chat with W8A8 quantization, 10 epochs is enough.

For learning rate, try 1e-3 or 2e-3.

For alpha, try 0.5 or 0.75.

Shunbrea commented 7 months ago

@ChenMnZ It appears that you have implemented a W8A8 model. May I inquire whether you would be willing to share your pre-trained model on huggingface?Thank you!