OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
663 stars 50 forks source link

Results Errors #25

Closed yileijin closed 10 months ago

yileijin commented 10 months ago

Could you please tell me how you got your ppl results on the wikitext2 in LLaMa models, I reused your ckpt but found there are some disparities.

ChenMnZ commented 10 months ago

You can reproduce the results by:

CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 0 --output_dir ./log/test \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc \
--resume /PATH/TO/Pretrained/Parameters 

Maybe you can provide more details about the command you used and the results you get.

yileijin commented 10 months ago

parser.add_argument("--model", default="huggyllama/llama-7b", type=str, help="model name of model path") parser.add_argument("--cache_dir", default="./cache", type=str, help="cache dir of dataset, leading to faster debug") parser.add_argument("--output_dir", default="./log/llama7b_w4a16g128", type=str, help="direction of logging file") parser.add_argument("--save_dir", default='./ckpt/llama7b_test', type=str, help="direction for saving fake quantization model") parser.add_argument("--real_quant", default=False, action="store_true",) parser.add_argument("--calib_dataset",type=str,default="wikitext2", choices=["wikitext2", "ptb", "c4", "mix","pile"], help="Where to extract calibration data from.", )

parser.add_argument("--omni_resume", type=str, default='./llama-7b-w4a16g128.pth)

parser.add_argument("--nsamples", type=int, default=128, help="Number of calibration data samples.") parser.add_argument("--batch_size", type=int, default=16, help="batch size.") parser.add_argument("--epochs", type=int, default=0)

parser.add_argument("--seed", type=int, default=2, help="Seed for sampling the calibration data.") parser.add_argument("--tasks", default='') parser.add_argument("--eval_ppl", default=True, action="store_true") parser.add_argument("--num_fewshot", type=int, default=0)

parser.add_argument("--wbits", type=int, default=4) parser.add_argument("--abits", type=int, default=16) parser.add_argument("--group_size", type=int, default=128)

parser.add_argument("--alpha", type=float, default=0.5) parser.add_argument("--let_lr", type=float, default=5e-3) parser.add_argument("--lwc_lr", type=float, default=1e-2) parser.add_argument("--wd", type=float, default=0)

parser.add_argument("--let",default=True, action="store_true",help="activate learnable equivalent transformation") parser.add_argument("--lwc",default=True, action="store_true",help="activate learnable weight clipping") parser.add_argument("--aug_loss", default=False, action="store_true", help="calculate additional loss with same input") parser.add_argument("--symmetric", default=False, action="store_true", help="symmetric quantization")

You see, the final results of your released ckpt in huggingface is: [2023-10-11 00:50:31 root] (main.py 157): INFO {'results': {'wikitext': {'word_perplexity': 22.08906210357998, 'byte_perplexity': 1.8170466512610957, 'bits_per_byte': 0.8615954601218602}}, 'versions': {'wikitext': 1}, 'config': {'model': <models.LMClass.LMClass object at 0x7fb87d2a5250>, 'model_args': None, 'num_fewshot': 0, 'limit': None, 'bootstrap_iters': 100000, 'description_dict': None}}

I have tested several times, and I'm doing work inspired by yours. So I need to figure out how it is exactly.

yileijin commented 10 months ago

sorry, not this result, it's the whole wikitext2, the w4a16 without group result is: [2023-10-09 22:45:41 root] (main.py 146): INFO wikitext2 : 6.04078483581543

yileijin commented 10 months ago

And the w3a16g128 results: 2023-10-10 02:02:05 root] (main.py 146): INFO wikitext2 : 6.360328197479248 w3a16 results: (I even tested twice) [2023-10-10 19:14:14 root] (main.py 146): INFO wikitext2 : 9.61421012878418 [2023-10-10 19:34:51 root] (main.py 146): INFO wikitext2 : 9.61421012878418

ChenMnZ commented 10 months ago

I have tested the checkpoints on this codebase with the following command just now:

CUDA_VISIBLE_DEVICES=2 python main.py \
--model /cpfs01/user/chenmengzhao/llama_quantization/llama-hf/llama-7b  \
--output_dir ./log/test \
--epochs 0 --nsamples 128 \
--wbits 3 --abits 16 --group_size 128 --lwc --aug_loss --eval_ppl \
--lwc_lr 0 \
--resume /cpfs01/user/chenmengzhao/prompt_quantization/OmniQuant/huggingface/OmniQuant/llama-7b-w3a16g128.pth

and the results is: ··· [2023-10-19 05:29:03 root](main.py 146): INFO wikitext2 : 6.15703010559082 ··· So, I don't sure what problem with you. Whether you have changed some code that leads this problem.

yileijin commented 10 months ago

Thanks, I will reload the project to try

yileijin commented 10 months ago

Sorry, you see, the results on w4a16g128. I just re-clone your project without change and re-download ckpt on hug then have a test on your checkpoint

[2023-10-19 15:48:05 root] (main.py 262): INFO Namespace(model='huggyllama/llama-7b', cache_dir='./cache', output_dir='./log/llama7b_w4a16g128', save_dir='./ckpt/llama7b_test', real_quant=False, calib_dataset='wikitext2', resume='/root/autodl-fs/llama-7b-w4a16g128.pth', nsamples=128, batch_size=8, epochs=0, seed=2, tasks='', eval_ppl=True, num_fewshot=0, wbits=4, abits=16, group_size=128, alpha=0.5, let_lr=0.008, lwc_lr=0.01, wd=0, let=True, lwc=True, aug_loss=False, symmetric=False, a_dynamic_method='per_token', w_dynamic_method='per_channel', limit=-1, multigpu=False, deactive_amp=False, net=None, act_scales='/root/autodl-fs/OmniQuant/act_scales/llama-7b.pt', act_shifts='/root/autodl-fs/OmniQuant/act_shifts/llama-7b.pt') [2023-10-19 15:48:15 huggingface_hub.utils._http] (_http.py 271): WARNING '(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /huggyllama/llama-7b/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f22eb4f76d0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 56ae5dab-f771-40c2-8fde-aeefada63b9b)')' thrown while requesting HEAD https://huggingface.co/huggyllama/llama-7b/resolve/main/config.json [2023-10-19 15:48:25 huggingface_hub.utils._http] (_http.py 271): WARNING '(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /huggyllama/llama-7b/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f22e71ce950>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 96333c89-22ff-4071-bb26-71a65347621c)')' thrown while requesting HEAD https://huggingface.co/huggyllama/llama-7b/resolve/main/tokenizer_config.json [2023-10-19 15:48:36 huggingface_hub.utils._http] (_http.py 271): WARNING '(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /huggyllama/llama-7b/resolve/main/adapter_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f22e032d490>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: b1a509d9-84d7-47cc-bb6f-9a0e49e484aa)')' thrown while requesting HEAD https://huggingface.co/huggyllama/llama-7b/resolve/main/adapter_config.json [2023-10-19 15:48:46 huggingface_hub.utils._http] (_http.py 271): WARNING '(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /huggyllama/llama-7b/resolve/main/adapter_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f22e0656f10>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 4e721f94-f3e0-4307-803e-13a001574fac)')' thrown while requesting HEAD https://huggingface.co/huggyllama/llama-7b/resolve/main/adapter_config.json [2023-10-19 15:48:56 huggingface_hub.utils._http] (_http.py 271): WARNING '(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /huggyllama/llama-7b/resolve/main/generation_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f22e71ccc90>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 8b9aab98-96fd-49d3-b029-4d5c14137a5c)')' thrown while requesting HEAD https://huggingface.co/huggyllama/llama-7b/resolve/main/generation_config.json [2023-10-19 15:48:56 root] (main.py 327): INFO === start quantization === [2023-10-19 15:48:56 root] (main.py 333): INFO load calibration from ./cache/dataloader_llama_wikitext2_128.cache [2023-10-19 15:48:57 root] (omniquant.py 30): INFO Starting ... [2023-10-19 15:49:02 root] (omniquant.py 158): INFO === Start quantize layer 0 === [2023-10-19 15:49:04 root] (omniquant.py 158): INFO === Start quantize layer 1 === [2023-10-19 15:49:04 root] (omniquant.py 158): INFO === Start quantize layer 2 === [2023-10-19 15:49:05 root] (omniquant.py 158): INFO === Start quantize layer 3 === [2023-10-19 15:49:05 root] (omniquant.py 158): INFO === Start quantize layer 4 === [2023-10-19 15:49:06 root] (omniquant.py 158): INFO === Start quantize layer 5 === [2023-10-19 15:49:06 root] (omniquant.py 158): INFO === Start quantize layer 6 === [2023-10-19 15:49:07 root] (omniquant.py 158): INFO === Start quantize layer 7 === [2023-10-19 15:49:07 root] (omniquant.py 158): INFO === Start quantize layer 8 === [2023-10-19 15:49:08 root] (omniquant.py 158): INFO === Start quantize layer 9 === [2023-10-19 15:49:08 root] (omniquant.py 158): INFO === Start quantize layer 10 === [2023-10-19 15:49:09 root] (omniquant.py 158): INFO === Start quantize layer 11 === [2023-10-19 15:49:09 root] (omniquant.py 158): INFO === Start quantize layer 12 === [2023-10-19 15:49:10 root] (omniquant.py 158): INFO === Start quantize layer 13 === [2023-10-19 15:49:10 root] (omniquant.py 158): INFO === Start quantize layer 14 === [2023-10-19 15:49:11 root] (omniquant.py 158): INFO === Start quantize layer 15 === [2023-10-19 15:49:11 root] (omniquant.py 158): INFO === Start quantize layer 16 === [2023-10-19 15:49:12 root] (omniquant.py 158): INFO === Start quantize layer 17 === [2023-10-19 15:49:12 root] (omniquant.py 158): INFO === Start quantize layer 18 === [2023-10-19 15:49:13 root] (omniquant.py 158): INFO === Start quantize layer 19 === [2023-10-19 15:49:13 root] (omniquant.py 158): INFO === Start quantize layer 20 === [2023-10-19 15:49:14 root] (omniquant.py 158): INFO === Start quantize layer 21 === [2023-10-19 15:49:14 root] (omniquant.py 158): INFO === Start quantize layer 22 === [2023-10-19 15:49:15 root] (omniquant.py 158): INFO === Start quantize layer 23 === [2023-10-19 15:49:16 root] (omniquant.py 158): INFO === Start quantize layer 24 === [2023-10-19 15:49:17 root] (omniquant.py 158): INFO === Start quantize layer 25 === [2023-10-19 15:49:18 root] (omniquant.py 158): INFO === Start quantize layer 26 === [2023-10-19 15:49:18 root] (omniquant.py 158): INFO === Start quantize layer 27 === [2023-10-19 15:49:19 root] (omniquant.py 158): INFO === Start quantize layer 28 === [2023-10-19 15:49:19 root] (omniquant.py 158): INFO === Start quantize layer 29 === [2023-10-19 15:49:20 root] (omniquant.py 158): INFO === Start quantize layer 30 === [2023-10-19 15:49:21 root] (omniquant.py 158): INFO === Start quantize layer 31 === [2023-10-19 15:49:22 root] (main.py 356): INFO 25.400123119354248 [2023-10-19 15:50:02 root] (main.py 102): INFO load calibration from ./cache/testloader_llama_wikitext2_all.cache [2023-10-19 15:51:48 root] (main.py 146): INFO wikitext2 : 5.820586681365967

One thing is llama model have updated with modeling_llama.py. However, my paper will laterly published, I will write my test results on my paper and invite you to check these results.

ChenMnZ commented 10 months ago

Through the above command, maybe I find your problem.

You should not use --let in the command. For the quantization of LLaMa, we only use --lwc.

You will obtain the correct results when discard --let.

Thanks for your follow, you can add my WeChat (id: chenmnz1) for further discussion.

brisker commented 10 months ago

@ChenMnZ In the paper you mentioned that LWC is better than LSQ and PACT, but in my opinion, there seems no difference among these methods. If weights are symmetrically quantized, using learnable quantization step size is equivalent to using learnable clipping threshold, right? Then LWC is just like LSQ or PACT in this case. The result in Table A3 in your paper is just somewhat confusing to me. ( You mentioned LLM-QAT paper also has similar findings, but LLM-QAT just proves that simply min-max clipping is enough without need of learnable clipping, right? There findings do not seem to have much correlations with Table A3 results in your paper)

ChenMnZ commented 10 months ago

@brisker In our paper, we used both LWC and LET. LET will change the magnitude of weight continuity, leading that directly learn step size or clipping threshold is unstable for training. So, we designed LWC, which learn clipping threshold by taking the maximum and minimum value after LET as a proxy, which is benefit for the training stability.