HandH1998 / QQQ

QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
https://arxiv.org/pdf/2406.09904
91 stars 8 forks source link

Question on rotation #13

Open cli99 opened 3 months ago

cli99 commented 3 months ago

Nice to see the newly added rotation support!

I saw below in the comment. https://github.com/HandH1998/QQQ/blob/bafdb00fd901cd13f668f4028d7891276ded3bb2/examples/quant_model.py#L70 Can you share any model quality results on the rotation, as well as any insights into the above observation? Thanks.

HandH1998 commented 3 months ago
@cli99 We share the wikitext2 PPL results below. Granularity Method Llama2-7b Llama2-13b Llama3-8b
per-channel smooth + GPTQ 5.9683 5.2091 7.4474
per-channel rotation + GPTQ 5.6872 5.0380 6.6940
per-group smooth + GPTQ 5.7118 5.0103 6.6769
per-group rotation + GPTQ 5.6726 5.0275 6.6134

The rotation method gets better results than smooth method on PPL, especially for per-channel quantization. We are currently evaluating the rotation method on accuracy datasets. Though the evaluation is not finished, we haved observed a weird phenomenon that the per-group rotation method get worse results on some datasets, such as mmlu and ceval. We are not sure if this is due to an error on our part or if the rotation method is simply not effective. We will share the results later. We would appreciate any observations you might have. Thanks.

RanchiZhao commented 2 months ago

I tried applying the QoQ method on MiniCPM3-4B and tested the drop points for BBH/MMLU/Ceval/Cmmlu/Humanebval/Mbpp/Gsm8k/Math benchmarks. Ceval, Humaneval, Mbpp, and Gsm8k experienced about a 10 percentage point drop, while the other benchmarks were all within three points. Additionally, the tests were conducted under fake-quantization scenarios.

HandH1998 commented 2 months ago

@RanchiZhao Could you share the detailed test results? And what tool you used to do the tests? You are using Rotation + GPTQ method?

RanchiZhao commented 2 months ago

@HandH1998 Yes, it is rotation+gptq, with evaluation based on transformers+ultraeval. Specifically, I applied lmquant (w4a8kv4, groupsize=32, only adding all R1 type rotation matrices declared in spinquant, no smooth involved) to an internal version of minicpm3-4b (a chat model). The calibration set used our sft data, only 128 items, with other parameters set to default. Because inference speed is very slow, I set the limit to 5 on evaluations like mmlu that involve multiple subsets, and a limit of 50 on other tasks.

HandH1998 commented 2 months ago

@RanchiZhao @cli99 I discovered that the 'mse' option in gptq significantly impacts the quantization result. While enabling 'mse' can yield excellent wikitext2 PPL, it might cause overfitting, which could negatively affect the performance on other datasets, such as mmlu and ceval. I hope this information is helpful.

Andy0422 commented 1 month ago

@cli99 We share the wikitext2 PPL results below.

Granularity Method Llama2-7b Llama2-13b Llama3-8b per-channel smooth + GPTQ 5.9683 5.2091 7.4474 per-channel rotation + GPTQ 5.6872 5.0380 6.6940 per-group smooth + GPTQ 5.7118 5.0103 6.6769 per-group rotation + GPTQ 5.6726 5.0275 6.6134 The rotation method gets better results than smooth method on PPL, especially for per-channel quantization. We are currently evaluating the rotation method on accuracy datasets. Though the evaluation is not finished, we haved observed a weird phenomenon that the per-group rotation method get worse results on some datasets, such as mmlu and ceval. We are not sure if this is due to an error on our part or if the rotation method is simply not effective. We will share the results later. We would appreciate any observations you might have. Thanks.

@HandH1998 Does these PPL results tested on the same calibration dataset? say both smoothquant+gptq on the wikitext2..

HandH1998 commented 1 month ago

@Andy0422 Pile for smooth and wikitext2 for gptq.