[Llama-2-7B-chat] ppl of w4a8 is nan

xingchensong commented 9 months ago

When I perform w4a4 quantization and w4a8 quantization separately on the Llama-2-7B-chat model, w4a8 yields significantly lower loss compared to w4a4. However, the PPL of w4a8 is "nan," while the PPL of w4a4 is 23.7.

please see the script and log I used to quantize the model:

w4a4

[2023-12-27 11:39:51 root] (main.py 251): INFO Namespace(
model='/jfs-hdfs/user/xingchen.song/share/LLM/Llama-2-7b-chat',
cache_dir='./cache', output_dir='./log/Llama-2-7b-chat-w4a4',
save_dir='exp/OmniQuant_Checkpoints/Llama-2-7b-chat-w4a4',
resume=None, real_quant=False, calib_dataset='wikitext2',
nsamples=128, batch_size=1, seed=2, tasks='', eval_ppl=True,
num_fewshot=0, wbits=4, abits=4, group_size=None, alpha=0.5,
let_lr=0.005, lwc_lr=0.01, wd=0, epochs=20, let=True, lwc=True,
aug_loss=True, symmetric=False, a_dynamic_method='per_token',
w_dynamic_method='per_channel', limit=-1, multigpu=False,
deactive_amp=False, net=None, act_scales=None, act_shifts=None)
[2023-12-27 11:39:53 root] (main.py 316): INFO === start quantization ===
[2023-12-27 11:39:53 root] (main.py 322): INFO load calibration from ./cache/dataloader_Llama_wikitext2_128.cache
[2023-12-27 11:39:53 root] (omniquant.py 47): INFO Starting ...
[2023-12-27 11:39:54 root] (omniquant.py 181): INFO === Start quantize layer 0 ===
[2023-12-27 11:40:23 root] (omniquant.py 262): INFO layer 0 iter 0 loss:0.00011710012040566653 norm:3.260538505855948e-05 max memory_allocated 14383.4755859375
[2023-12-27 11:40:44 root] (omniquant.py 262): INFO layer 0 iter 1 loss:8.657259604660794e-05 norm:9.548537491355091e-06 max memory_allocated 14383.4755859375
[2023-12-27 11:41:05 root] (omniquant.py 262): INFO layer 0 iter 2 loss:7.863873179303482e-05 norm:8.535305823897943e-06 max memory_allocated 14383.4755859375
[2023-12-27 11:41:26 root] (omniquant.py 262): INFO layer 0 iter 3 loss:7.489907875424251e-05 norm:8.39468884805683e-06 max memory_allocated 14383.4755859375
[2023-12-27 11:41:48 root] (omniquant.py 262): INFO layer 0 iter 4 loss:7.386491051875055e-05 norm:1.1096978596469853e-05 max memory_allocated 14383.4755859375
[2023-12-27 11:42:09 root] (omniquant.py 262): INFO layer 0 iter 5 loss:7.13972985977307e-05 norm:1.4607306184188928e-05 max memory_allocated 14383.4755859375
...
...
...
[2023-12-27 15:33:45 root] (omniquant.py 262): INFO layer 31 iter 15 loss:3.263472557067871 norm:0.043484702706336975 max memory_allocated 14410.3583984375
[2023-12-27 15:34:07 root] (omniquant.py 262): INFO layer 31 iter 16 loss:3.261472463607788 norm:0.04224986955523491 max memory_allocated 14410.3583984375
[2023-12-27 15:34:28 root] (omniquant.py 262): INFO layer 31 iter 17 loss:3.2586634159088135 norm:0.04047902300953865 max memory_allocated 14410.3583984375
[2023-12-27 15:34:50 root] (omniquant.py 262): INFO layer 31 iter 18 loss:3.256319522857666 norm:0.03761046379804611 max memory_allocated 14410.3583984375
[2023-12-27 15:35:11 root] (omniquant.py 262): INFO layer 31 iter 19 loss:3.2562286853790283 norm:0.03877091407775879 max memory_allocated 14410.3583984375
[2023-12-27 15:35:15 root] (main.py 345): INFO 14122.338710308075
[2023-12-27 15:35:58 root] (main.py 100): INFO load calibration from ./cache/testloader_Llama_wikitext2_all.cache
[2023-12-27 15:38:12 root] (main.py 144): INFO wikitext2 : 23.720489501953125
[2023-12-27 15:38:12 root] (main.py 100): INFO load calibration from ./cache/testloader_Llama_ptb_all.cache
[2023-12-27 15:38:52 root] (main.py 144): INFO ptb : 663.4659423828125

w4a8

[2024-01-02 09:57:31 root] (main.py 257): INFO Namespace(
model='/jfs-hdfs/user/xingchen.song/share/LLM/Llama-2-7b-chat',
cache_dir='./cache', output_dir='./log/Llama-2-7b-chat-w4a8',
save_dir='exp/OmniQuant_Checkpoints/Llama-2-7b-chat-w4a8',
resume=None, real_quant=False, calib_dataset='wikitext2',
nsamples=128, batch_size=1, seed=2, tasks='', eval_ppl=True,
num_fewshot=0, wbits=4, abits=8, group_size=None, alpha=0.5,
let_lr=0.005, lwc_lr=0.01, wd=0, epochs=50, let=True, lwc=True,
aug_loss=True, symmetric=False, a_dynamic_method='per_token',
w_dynamic_method='per_channel', limit=-1, multigpu=False,
deactive_amp=True, attn_implementation='eager', net=None, act_scales=None, act_shifts=None)
[2024-01-02 09:57:49 root] (main.py 322): INFO === start quantization ===
[2024-01-02 09:57:50 root] (main.py 328): INFO load calibration from ./cache/dataloader_Llama_wikitext2_128.cache
[2024-01-02 09:57:51 root] (omniquant.py 47): INFO Starting ...
[2024-01-02 09:58:09 root] (omniquant.py 190): INFO === Start quantize layer 0 ===
[2024-01-02 09:58:54 root] (omniquant.py 271): INFO layer 0 iter 0 loss:1.4345696399686858e-05 norm:4.120320227229968e-06 max memory_allocated 20445.7119140625
[2024-01-02 09:59:27 root] (omniquant.py 271): INFO layer 0 iter 1 loss:1.053816686180653e-05 norm:3.7067807170387823e-06 max memory_allocated 20445.7119140625
[2024-01-02 10:00:00 root] (omniquant.py 271): INFO layer 0 iter 2 loss:9.328913620265666e-06 norm:3.473989181657089e-06 max memory_allocated 20445.7119140625
[2024-01-02 10:00:33 root] (omniquant.py 271): INFO layer 0 iter 3 loss:8.979684935184196e-06 norm:3.6778537833015434e-06 max memory_allocated 20445.7119140625
[2024-01-02 10:01:06 root] (omniquant.py 271): INFO layer 0 iter 4 loss:8.662630534672644e-06 norm:2.7172736736247316e-06 max memory_allocated 20445.7119140625
[2024-01-02 10:01:39 root] (omniquant.py 271): INFO layer 0 iter 5 loss:8.493237146467436e-06 norm:2.8158544864709256e-06 max memory_allocated 20445.7119140625
...
...
...
[2024-01-03 00:54:08 root] (omniquant.py 271): INFO layer 31 iter 45 loss:1.0911047458648682 norm:0.016510091722011566 max memory_allocated 20473.8056640625
[2024-01-03 00:54:42 root] (omniquant.py 271): INFO layer 31 iter 46 loss:1.0910612344741821 norm:0.016708627343177795 max memory_allocated 20473.8056640625
[2024-01-03 00:55:16 root] (omniquant.py 271): INFO layer 31 iter 47 loss:1.0911448001861572 norm:0.01610805094242096 max memory_allocated 20473.8056640625
[2024-01-03 00:55:50 root] (omniquant.py 271): INFO layer 31 iter 48 loss:1.0910682678222656 norm:0.016231702640652657 max memory_allocated 20473.8056640625
[2024-01-03 00:56:24 root] (omniquant.py 271): INFO layer 31 iter 49 loss:1.0912127494812012 norm:0.016619287431240082 max memory_allocated 20473.8056640625
[2024-01-03 00:56:29 root] (main.py 351): INFO 53919.22867846489
[2024-01-03 00:57:30 root] (main.py 100): INFO load calibration from ./cache/testloader_Llama_wikitext2_all.cache
[2024-01-03 00:59:48 root] (main.py 144): INFO wikitext2 : nan
[2024-01-03 00:59:48 root] (main.py 100): INFO load calibration from ./cache/testloader_Llama_ptb_all.cache
[2024-01-03 01:00:29 root] (main.py 144): INFO ptb : nan

ChenMnZ commented 9 months ago

It is weird. I will give it a try.

xingchensong commented 9 months ago

According to this comment https://github.com/OpenGVLab/OmniQuant/issues/25#issuecomment-1770278455

It might be caused by let, I will conduct a comparative experiment that does not involve it. (However, this issue still remains peculiar because w4a4 also incorporates the use of let)

ChenMnZ commented 9 months ago

Sorry for the confusion.

Actually, the command you used is right. For LLaMa weight only quantization, we only use --lwc. For LLaMa weight-activation quantization, we use both --lwc and --let.

xingchensong commented 9 months ago

Adding the parameters --let_lr 1e-3 and --alpha 0.75 resolved the issue for the W4A8 configuration.

OpenGVLab / OmniQuant

[Llama-2-7B-chat] ppl of w4a8 is nan #51