About W2A16 weight only matmul

bytedance / ABQ-LLM

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

Apache License 2.0

226 stars 25 forks source link

About W2A16 weight only matmul #10

Open goddice opened 1 month ago

goddice commented 1 month ago

Hi, if I have a linear layer the weight only has the value of {0, 1, -1}. Is it possible to utilize your kernel for weight compression and inference speed-up? My current weight is in bfloat16 format.

For example, if I have this code:

input = torch.randn(64, 1024, dtype=torch.bfloat16).cuda() weights = torch.randint(-1, 2, (1024, 1024), dtype=torch.int8) weights_bf16 = weights.bfloat16().cuda() output = torch.nn.functional.linear(input, weights_bf16, None)

How to use ABQ's kernel to optimize the computation? Thanks!

lswzjuer commented 1 month ago

Thanks for your attention to our work. Matrix multiplication of int and float is not supported, but based on experience in model optimization, the effect of int16 and float16 will be basically aligned (sd or llm).

So I suggest you try W2Aint16. In this case, you can directly use our operator for acceleration. Our operator is suitable for W2 scenarios.

goddice commented 1 month ago

Thanks for your attention to our work. Matrix multiplication of int and float is not supported, but based on experience in model optimization, the effect of int16 and float16 will be basically aligned (sd or llm).

So I suggest you try W2Aint16. In this case, you can directly use our operator for acceleration. Our operator is suitable for W2 scenarios.

Thanks for the reply. So if all of the linear layer in my model has the bfloat16 weight with values {0,1,-1}, and the input and output are blfoat16. What steps should I do to use your W2Aint16 operator? Thanks!