PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.66k stars 5.44k forks source link

[PHI] add int4 weight only quant kernel, add int4 weight only permute kernel #64091

Closed yinfan98 closed 1 week ago

yinfan98 commented 1 week ago

PR Category

Others

PR Types

New features

Description

给paddle添加int4量化的kernel和int4量化进行permute的kernel。

int4量化kernel

对于int4量化来说,它需要让两个int4pack成一个int8的数进行实现。在代码里,我们让上下两行组成一个int8的数,也就是按列进行的pack。

int4 permute kernel

对于int4量化,我们需要对输入数据进行重排来适配cutlass的快速反量化kernel。 在int4反量化端,我们可以发现。最后所需的输出是:

0   1   8   9  16  17  24  25   2   3  10  11  18  19  26  27
4   5  12  13  20  21  28  29   6   7  14  15  22  23  30  31

这样我们可以反推一下最后量化完的数据应该是什么格式的。 快速反量化是把

0 2 4 6 1 3 5 7 -> 0 1 2 3 4 5 6 7

则我们可以推得在快速反量化之前,我们需要的数据是

//  0   8  16  24   1   9  17  25   2  10  18  26   3  11  19  27
//  4  12  20  28   5  13  21  29   6  14  22  30   7  15  23  31

这一组数看上去没有任何的规律,但是我们可以给它做一点小小的调整,调整成下面的形式,只需要一些简单的位运算即可

// 0 1 16 17 8 9 24 25 2 3 18 19 10 11 26 27
// 4 5 20 21 12 13 28 29 6 7 22 23 14 15 30 31

我们知道,两个int4 pack成了一个int8,我们也可以把上面的数调整成int8的index

0 8 4 12 1 9 5 13 2 10 6 14 3 11 7 15

那么从

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -> 0 8 4 12 1 9 5 13 2 10 6 14 3 11 7 15

的坐标为

0 4 8 12 2 6 10 14 1 5 9 13 3 7 11 15

得到这个新的permute_kk(代码里的变量,描述列之间的permute),可以通过int8的permute_kk做一点小小的改变

如何从int8 permute转换为int4 permute

int8

0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15

可以把它变成

0 2 4 6 1 3 5 7 8 10 12 14 9 11 13 15

% 8 * 2

0 4 8 12 2 6 10 14 0 4 8 12 2 6 10 14
add 1 for 0 4 8 12 2 6 10 14 [0 4 8 12 2 6 10 14]
简单的位运算

// (0 1) (16 17) (8 9) (24 25) (2 3) (18 19) (10 11) (26 27)
// (4 5) (20 21) (12 13) (28 29) (6 7) (22 23) (14 15) (30 31)

//  0   8  16  24   1   9  17  25   2  10  18  26   3  11  19  27
//  4  12  20  28   5  13  21  29   6  14  22  30   7  15  23  31

我们可以每四个数一组,然后02 13 之间做低四位和高四位的交换即可。

int4 row interval

对于int8的case,代码在相邻的两行中,每64个元素进行交织。但是对于int4的情况。代码就会在相邻的四行中,每32个元素进行交织。所以在permute的处理时,写成了

int permute_index = permute_kk % 32 + permute_kk / 32 * 128 +
                        32 * (n_id % 4) + total_k * 4 * (n_id / 4);

这样也符合预期。

paddle-bot[bot] commented 1 week ago

你的PR提交成功,感谢你对开源项目的贡献! 请关注后续CI自动化测试结果,详情请参考Paddle-CI手册。 Your PR has been submitted. Thanks for your contribution! Please wait for the result of CI firstly. See Paddle CI Manual for details.

CLAassistant commented 1 week ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

:white_check_mark: yinfan98
:x: Your Name


Your Name seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant commented 1 week ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

:white_check_mark: yinfan98
:x: Your Name


Your Name seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.