bytedance / decoupleQ

A quantization algorithm for LLM
Apache License 2.0
91 stars 5 forks source link

LinearA16 and LinearW2A16 output does not match #12

Open XLuoxing opened 1 month ago

XLuoxing commented 1 month ago

使用decoupleQ压缩llama-3-8b,使用 fake_quant - linearA16 的计算结果和true_quant - linearW2A16的计算结果不match 截图中,左边是linearW2A16的输出,右边是linearA16的输出 input数据类型是bfloat16

请问这种情况可能的原因是什么呀?

image
XLuoxing commented 1 month ago

重复对同样的input进行计算,似乎每次的输出都不太一样,但是output的元素都很接近0

image

极其偶尔会有这种报错提示:*** RuntimeError: [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal

看起来像是输出了初始化为0的output,并没有进行真正的计算? 请问这种情况怎么调试/解决比较好?

GuoYi0 commented 1 month ago

@XLuoxing 感谢关注我们的工作。 要不检查一下,fake_quant.pth和true_quant.pth是否能对上?可以将true_quant.pth里面的weights, scale, zero做一个计算,反量化一下,看计算得到的结果是否等于fake_quant.pth里面对应的元素值

gavinchen430 commented 1 month ago

from decoupleQ.linear_w2a16 import LinearA16 from decoupleQ.linear_w2a16 import LinearW2A16 import torch

batch=2 in_features = 5120 out_features = 5120 group_size = 64 with_bias = False

def expand_group(x, group_size): return x.transpose(0, 1).unsqueeze(-1).repeat(1, 1, group_size).view(x.size(1), -1).transpose(0, 1).contiguous()

linear_a16 = LinearA16(in_features, out_features, with_bias, group_size) linear_w2a16 = LinearW2A16(in_features, out_features, with_bias, group_size)

weight_w2 = torch.randint(-2, 2, (in_features, out_features), dtype=torch.int8).cuda() weight_scale = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda() weight_zp = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda()

weight_a16 = weight_w2.to(torch.bfloat16) * expand_group(weight_scale, group_size) + expand_group(weight_zp, group_size)

linear_a16.weight = weight_a16 linear_w2a16.weight = weight_w2 linear_w2a16.scale = weight_scale linear_w2a16.zp = weight_zp

input = torch.rand((batch, in_features), dtype=torch.bfloat16).cuda()

out_a16 = linear_a16(input) out_w2a16 = linear_w2a16(input)

print(out_a16) print(out_w2a16)

你可以用上面的测试程序,测试下你的模型对应的shape,看是否正常。 image

XLuoxing commented 1 month ago

@XLuoxing 感谢关注我们的工作。 要不检查一下,fake_quant.pth和true_quant.pth是否能对上?可以将true_quant.pth里面的weights, scale, zero做一个计算,反量化一下,看计算得到的结果是否等于fake_quant.pth里面对应的元素值

检查过了,可以对上,把true_quant里面的weight * scale + zero,再转化为bfloat16,能对上的

XLuoxing commented 1 month ago

from decoupleQ.linear_w2a16 import LinearA16 from decoupleQ.linear_w2a16 import LinearW2A16 import torch

batch=2 in_features = 5120 out_features = 5120 group_size = 64 with_bias = False

def expand_group(x, group_size): return x.transpose(0, 1).unsqueeze(-1).repeat(1, 1, group_size).view(x.size(1), -1).transpose(0, 1).contiguous()

linear_a16 = LinearA16(in_features, out_features, with_bias, group_size) linear_w2a16 = LinearW2A16(in_features, out_features, with_bias, group_size)

weight_w2 = torch.randint(-2, 2, (in_features, out_features), dtype=torch.int8).cuda() weight_scale = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda() weight_zp = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda()

weight_a16 = weight_w2.to(torch.bfloat16) * expand_group(weight_scale, group_size) + expand_group(weight_zp, group_size)

linear_a16.weight = weight_a16 linear_w2a16.weight = weight_w2 linear_w2a16.scale = weight_scale linear_w2a16.zp = weight_zp

input = torch.rand((batch, in_features), dtype=torch.bfloat16).cuda()

out_a16 = linear_a16(input) out_w2a16 = linear_w2a16(input)

print(out_a16) print(out_w2a16)

你可以用上面的测试程序,测试下你的模型对应的shape,看是否正常。 image

跑了一下,w2a16和a16的shape是能对上的,值对不上

image

build log:

-- Found Torch: **/python3.11/site-packages/torch/lib/libtorch.so
f"python3 found" f"torch found" -- Configuring done (11.4s) -- Generating done (0.1s) -- Build files have been written to: ****/decoupleQ/csrc/build

Remark: The warnings can be suppressed with "-diag-suppress "

****/decoupleQ/csrc/w2a16.cu(60): warning #177-D: variable "res_ptr" was declared but never referenced const T* res_ptr = nullptr; ^ detected during: instantiation of "TrtllmFpAIntBGemm<T, WeightType, QuantOp>::TrtllmFpAIntBGemm() [with T=half, WeightType=cutlass::uint2b_t, QuantOp=cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS]" at line 962 of /usr/include/c++/11/bits/unique_ptr.h instantiation of "std::_MakeUniq<_Tp>::__single_object std::make_unique<_Tp,_Args...>(_Args &&...) [with _Tp=TrtllmFpAIntBGemm<half, cutlass::uint2b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS>, _Args=<>]" at line 104

****/decoupleQ/csrc/w2a16.cu(153): warning #177-D: variable "quant_range_scale" was declared but never referenced const float quant_range_scale = 1.f / float(1 << (bits_in_type - 1)); ^

Remark: The warnings can be suppressed with "-diag-suppress "

****/decoupleQ/csrc/w2a16.cu(60): warning #177-D: variable "res_ptr" was declared but never referenced const T* res_ptr = nullptr; ^ detected during: instantiation of "TrtllmFpAIntBGemm<T, WeightType, QuantOp>::TrtllmFpAIntBGemm() [with T=half, WeightType=cutlass::uint2b_t, QuantOp=cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS]" at line 962 of /usr/include/c++/11/bits/unique_ptr.h instantiation of "std::_MakeUniq<_Tp>::__single_object std::make_unique<_Tp,_Args...>(_Args &&...) [with _Tp=TrtllmFpAIntBGemm<half, cutlass::uint2b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS>, _Args=<>]" at line 104

[ 91%] Linking CUDA device code CMakeFiles/decoupleQ_kernels.dir/cmake_device_link.o [100%] Linking CXX shared library libdecoupleQ_kernels.so [100%] Built target decoupleQ_kernels

gavinchen430 commented 1 month ago

你这边测试的平台是A10吗?我看编译的是86架构的代码, 编译的日志上看是都正常的 我这边在A10上 也是可以测试通过的,编译环境: cmake: 3.26.3 nvcc: 12.2 torch: 2.1.0 gpu Driver: 510.47.03