Open XLuoxing opened 1 month ago
重复对同样的input进行计算,似乎每次的输出都不太一样,但是output的元素都很接近0
极其偶尔会有这种报错提示:*** RuntimeError: [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal
看起来像是输出了初始化为0的output,并没有进行真正的计算? 请问这种情况怎么调试/解决比较好?
@XLuoxing 感谢关注我们的工作。 要不检查一下,fake_quant.pth和true_quant.pth是否能对上?可以将true_quant.pth里面的weights, scale, zero做一个计算,反量化一下,看计算得到的结果是否等于fake_quant.pth里面对应的元素值
from decoupleQ.linear_w2a16 import LinearA16 from decoupleQ.linear_w2a16 import LinearW2A16 import torch
batch=2 in_features = 5120 out_features = 5120 group_size = 64 with_bias = False
def expand_group(x, group_size): return x.transpose(0, 1).unsqueeze(-1).repeat(1, 1, group_size).view(x.size(1), -1).transpose(0, 1).contiguous()
linear_a16 = LinearA16(in_features, out_features, with_bias, group_size) linear_w2a16 = LinearW2A16(in_features, out_features, with_bias, group_size)
weight_w2 = torch.randint(-2, 2, (in_features, out_features), dtype=torch.int8).cuda() weight_scale = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda() weight_zp = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda()
weight_a16 = weight_w2.to(torch.bfloat16) * expand_group(weight_scale, group_size) + expand_group(weight_zp, group_size)
linear_a16.weight = weight_a16 linear_w2a16.weight = weight_w2 linear_w2a16.scale = weight_scale linear_w2a16.zp = weight_zp
input = torch.rand((batch, in_features), dtype=torch.bfloat16).cuda()
out_a16 = linear_a16(input) out_w2a16 = linear_w2a16(input)
print(out_a16) print(out_w2a16)
你可以用上面的测试程序,测试下你的模型对应的shape,看是否正常。
@XLuoxing 感谢关注我们的工作。 要不检查一下,fake_quant.pth和true_quant.pth是否能对上?可以将true_quant.pth里面的weights, scale, zero做一个计算,反量化一下,看计算得到的结果是否等于fake_quant.pth里面对应的元素值
检查过了,可以对上,把true_quant里面的weight * scale + zero,再转化为bfloat16,能对上的
from decoupleQ.linear_w2a16 import LinearA16 from decoupleQ.linear_w2a16 import LinearW2A16 import torch
batch=2 in_features = 5120 out_features = 5120 group_size = 64 with_bias = False
def expand_group(x, group_size): return x.transpose(0, 1).unsqueeze(-1).repeat(1, 1, group_size).view(x.size(1), -1).transpose(0, 1).contiguous()
linear_a16 = LinearA16(in_features, out_features, with_bias, group_size) linear_w2a16 = LinearW2A16(in_features, out_features, with_bias, group_size)
weight_w2 = torch.randint(-2, 2, (in_features, out_features), dtype=torch.int8).cuda() weight_scale = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda() weight_zp = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda()
weight_a16 = weight_w2.to(torch.bfloat16) * expand_group(weight_scale, group_size) + expand_group(weight_zp, group_size)
linear_a16.weight = weight_a16 linear_w2a16.weight = weight_w2 linear_w2a16.scale = weight_scale linear_w2a16.zp = weight_zp
input = torch.rand((batch, in_features), dtype=torch.bfloat16).cuda()
out_a16 = linear_a16(input) out_w2a16 = linear_w2a16(input)
print(out_a16) print(out_w2a16)
你可以用上面的测试程序,测试下你的模型对应的shape,看是否正常。
跑了一下,w2a16和a16的shape是能对上的,值对不上
build log:
-- Found Torch: **/python3.11/site-packages/torch/lib/libtorch.so
f"python3 found"
f"torch found"
-- Configuring done (11.4s)
-- Generating done (0.1s)
-- Build files have been written to: ****/decoupleQ/csrc/build
Remark: The warnings can be suppressed with "-diag-suppress
****/decoupleQ/csrc/w2a16.cu(60): warning #177-D: variable "res_ptr" was declared but never referenced const T* res_ptr = nullptr; ^ detected during: instantiation of "TrtllmFpAIntBGemm<T, WeightType, QuantOp>::TrtllmFpAIntBGemm() [with T=half, WeightType=cutlass::uint2b_t, QuantOp=cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS]" at line 962 of /usr/include/c++/11/bits/unique_ptr.h instantiation of "std::_MakeUniq<_Tp>::__single_object std::make_unique<_Tp,_Args...>(_Args &&...) [with _Tp=TrtllmFpAIntBGemm<half, cutlass::uint2b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS>, _Args=<>]" at line 104
****/decoupleQ/csrc/w2a16.cu(153): warning #177-D: variable "quant_range_scale" was declared but never referenced const float quant_range_scale = 1.f / float(1 << (bits_in_type - 1)); ^
Remark: The warnings can be suppressed with "-diag-suppress
****/decoupleQ/csrc/w2a16.cu(60): warning #177-D: variable "res_ptr" was declared but never referenced const T* res_ptr = nullptr; ^ detected during: instantiation of "TrtllmFpAIntBGemm<T, WeightType, QuantOp>::TrtllmFpAIntBGemm() [with T=half, WeightType=cutlass::uint2b_t, QuantOp=cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS]" at line 962 of /usr/include/c++/11/bits/unique_ptr.h instantiation of "std::_MakeUniq<_Tp>::__single_object std::make_unique<_Tp,_Args...>(_Args &&...) [with _Tp=TrtllmFpAIntBGemm<half, cutlass::uint2b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS>, _Args=<>]" at line 104
[ 91%] Linking CUDA device code CMakeFiles/decoupleQ_kernels.dir/cmake_device_link.o [100%] Linking CXX shared library libdecoupleQ_kernels.so [100%] Built target decoupleQ_kernels
你这边测试的平台是A10吗?我看编译的是86架构的代码, 编译的日志上看是都正常的 我这边在A10上 也是可以测试通过的,编译环境: cmake: 3.26.3 nvcc: 12.2 torch: 2.1.0 gpu Driver: 510.47.03
使用decoupleQ压缩llama-3-8b,使用 fake_quant - linearA16 的计算结果和true_quant - linearW2A16的计算结果不match 截图中,左边是linearW2A16的输出,右边是linearA16的输出 input数据类型是bfloat16
请问这种情况可能的原因是什么呀?