Open cli99 opened 3 months ago
@cli99 We share the wikitext2 PPL results below. | Granularity | Method | Llama2-7b | Llama2-13b | Llama3-8b |
---|---|---|---|---|---|
per-channel | smooth + GPTQ | 5.9683 | 5.2091 | 7.4474 | |
per-channel | rotation + GPTQ | 5.6872 | 5.0380 | 6.6940 | |
per-group | smooth + GPTQ | 5.7118 | 5.0103 | 6.6769 | |
per-group | rotation + GPTQ | 5.6726 | 5.0275 | 6.6134 |
The rotation method gets better results than smooth method on PPL, especially for per-channel quantization. We are currently evaluating the rotation method on accuracy datasets. Though the evaluation is not finished, we haved observed a weird phenomenon that the per-group rotation method get worse results on some datasets, such as mmlu and ceval. We are not sure if this is due to an error on our part or if the rotation method is simply not effective. We will share the results later. We would appreciate any observations you might have. Thanks.
I tried applying the QoQ method on MiniCPM3-4B and tested the drop points for BBH/MMLU/Ceval/Cmmlu/Humanebval/Mbpp/Gsm8k/Math benchmarks. Ceval, Humaneval, Mbpp, and Gsm8k experienced about a 10 percentage point drop, while the other benchmarks were all within three points. Additionally, the tests were conducted under fake-quantization scenarios.
@RanchiZhao Could you share the detailed test results? And what tool you used to do the tests? You are using Rotation + GPTQ method?
@HandH1998 Yes, it is rotation+gptq, with evaluation based on transformers+ultraeval. Specifically, I applied lmquant (w4a8kv4, groupsize=32, only adding all R1 type rotation matrices declared in spinquant, no smooth involved) to an internal version of minicpm3-4b (a chat model). The calibration set used our sft data, only 128 items, with other parameters set to default. Because inference speed is very slow, I set the limit to 5 on evaluations like mmlu that involve multiple subsets, and a limit of 50 on other tasks.
@RanchiZhao @cli99 I discovered that the 'mse' option in gptq significantly impacts the quantization result. While enabling 'mse' can yield excellent wikitext2 PPL, it might cause overfitting, which could negatively affect the performance on other datasets, such as mmlu and ceval. I hope this information is helpful.
@cli99 We share the wikitext2 PPL results below.
Granularity Method Llama2-7b Llama2-13b Llama3-8b per-channel smooth + GPTQ 5.9683 5.2091 7.4474 per-channel rotation + GPTQ 5.6872 5.0380 6.6940 per-group smooth + GPTQ 5.7118 5.0103 6.6769 per-group rotation + GPTQ 5.6726 5.0275 6.6134 The rotation method gets better results than smooth method on PPL, especially for per-channel quantization. We are currently evaluating the rotation method on accuracy datasets. Though the evaluation is not finished, we haved observed a weird phenomenon that the per-group rotation method get worse results on some datasets, such as mmlu and ceval. We are not sure if this is due to an error on our part or if the rotation method is simply not effective. We will share the results later. We would appreciate any observations you might have. Thanks.
@HandH1998 Does these PPL results tested on the same calibration dataset? say both smoothquant+gptq on the wikitext2..
@Andy0422 Pile for smooth and wikitext2 for gptq.
Nice to see the newly added rotation support!
I saw below in the comment. https://github.com/HandH1998/QQQ/blob/bafdb00fd901cd13f668f4028d7891276ded3bb2/examples/quant_model.py#L70 Can you share any model quality results on the rotation, as well as any insights into the above observation? Thanks.