Open goddice opened 1 month ago
Thanks for your attention to our work. Matrix multiplication of int and float is not supported, but based on experience in model optimization, the effect of int16 and float16 will be basically aligned (sd or llm).
So I suggest you try W2Aint16. In this case, you can directly use our operator for acceleration. Our operator is suitable for W2 scenarios.
Thanks for your attention to our work. Matrix multiplication of int and float is not supported, but based on experience in model optimization, the effect of int16 and float16 will be basically aligned (sd or llm).
So I suggest you try W2Aint16. In this case, you can directly use our operator for acceleration. Our operator is suitable for W2 scenarios.
Thanks for the reply. So if all of the linear layer in my model has the bfloat16 weight with values {0,1,-1}, and the input and output are blfoat16. What steps should I do to use your W2Aint16 operator? Thanks!
Hi, if I have a linear layer the weight only has the value of {0, 1, -1}. Is it possible to utilize your kernel for weight compression and inference speed-up? My current weight is in bfloat16 format.
For example, if I have this code:
input = torch.randn(64, 1024, dtype=torch.bfloat16).cuda()
weights = torch.randint(-1, 2, (1024, 1024), dtype=torch.int8)
weights_bf16 = weights.bfloat16().cuda()
output = torch.nn.functional.linear(input, weights_bf16, None)
How to use ABQ's kernel to optimize the computation? Thanks!