LinearA16 and LinearW2A16 output does not match

XLuoxing commented 1 month ago

使用decoupleQ压缩llama-3-8b，使用 fake_quant - linearA16 的计算结果和true_quant - linearW2A16的计算结果不match 截图中，左边是linearW2A16的输出，右边是linearA16的输出 input数据类型是bfloat16

请问这种情况可能的原因是什么呀？

XLuoxing commented 1 month ago

重复对同样的input进行计算，似乎每次的输出都不太一样，但是output的元素都很接近0

极其偶尔会有这种报错提示：*** RuntimeError: [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal

看起来像是输出了初始化为0的output，并没有进行真正的计算？请问这种情况怎么调试/解决比较好？

GuoYi0 commented 1 month ago

@XLuoxing 感谢关注我们的工作。要不检查一下，fake_quant.pth和true_quant.pth是否能对上？可以将true_quant.pth里面的weights, scale, zero做一个计算，反量化一下，看计算得到的结果是否等于fake_quant.pth里面对应的元素值

gavinchen430 commented 1 month ago

from decoupleQ.linear_w2a16 import LinearA16 from decoupleQ.linear_w2a16 import LinearW2A16 import torch

batch=2 in_features = 5120 out_features = 5120 group_size = 64 with_bias = False

def expand_group(x, group_size): return x.transpose(0, 1).unsqueeze(-1).repeat(1, 1, group_size).view(x.size(1), -1).transpose(0, 1).contiguous()

linear_a16 = LinearA16(in_features, out_features, with_bias, group_size) linear_w2a16 = LinearW2A16(in_features, out_features, with_bias, group_size)

weight_w2 = torch.randint(-2, 2, (in_features, out_features), dtype=torch.int8).cuda() weight_scale = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda() weight_zp = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda()

weight_a16 = weight_w2.to(torch.bfloat16) * expand_group(weight_scale, group_size) + expand_group(weight_zp, group_size)

linear_a16.weight = weight_a16 linear_w2a16.weight = weight_w2 linear_w2a16.scale = weight_scale linear_w2a16.zp = weight_zp

input = torch.rand((batch, in_features), dtype=torch.bfloat16).cuda()

out_a16 = linear_a16(input) out_w2a16 = linear_w2a16(input)

print(out_a16) print(out_w2a16)

你可以用上面的测试程序，测试下你的模型对应的shape，看是否正常。

XLuoxing commented 1 month ago

@XLuoxing 感谢关注我们的工作。要不检查一下，fake_quant.pth和true_quant.pth是否能对上？可以将true_quant.pth里面的weights, scale, zero做一个计算，反量化一下，看计算得到的结果是否等于fake_quant.pth里面对应的元素值

检查过了，可以对上，把true_quant里面的weight * scale + zero，再转化为bfloat16，能对上的

XLuoxing commented 1 month ago

from decoupleQ.linear_w2a16 import LinearA16 from decoupleQ.linear_w2a16 import LinearW2A16 import torch

batch=2 in_features = 5120 out_features = 5120 group_size = 64 with_bias = False

def expand_group(x, group_size): return x.transpose(0, 1).unsqueeze(-1).repeat(1, 1, group_size).view(x.size(1), -1).transpose(0, 1).contiguous()

linear_a16 = LinearA16(in_features, out_features, with_bias, group_size) linear_w2a16 = LinearW2A16(in_features, out_features, with_bias, group_size)

weight_w2 = torch.randint(-2, 2, (in_features, out_features), dtype=torch.int8).cuda() weight_scale = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda() weight_zp = torch.rand((in_features//group_size, out_features), dtype=torch.bfloat16).cuda()

weight_a16 = weight_w2.to(torch.bfloat16) * expand_group(weight_scale, group_size) + expand_group(weight_zp, group_size)

linear_a16.weight = weight_a16 linear_w2a16.weight = weight_w2 linear_w2a16.scale = weight_scale linear_w2a16.zp = weight_zp

input = torch.rand((batch, in_features), dtype=torch.bfloat16).cuda()

out_a16 = linear_a16(input) out_w2a16 = linear_w2a16(input)

print(out_a16) print(out_w2a16)

你可以用上面的测试程序，测试下你的模型对应的shape，看是否正常。

跑了一下，w2a16和a16的shape是能对上的，值对不上

build log:

cd csrc
mkdir -p build
cd build
export TORCH_CUDA_ARCH_LIST=8.0+PTX
TORCH_CUDA_ARCH_LIST=8.0+PTX
cmake -DCMAKE_PREFIX_PATH=**/python3.11/site-packages/torch -DDECOUPLEQ_TORCH_HOME=**/python3.11/site-packages/torch -DCMAKE_BUILD_TYPE=Release -DDECOUPLEQ_CUDA_HOME=/usr/local/cuda -DDECOUPLEQ_CUDNN_HOME=/usr/local/cuda .. -- The CXX compiler identification is GNU 11.4.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- The CUDA compiler identification is NVIDIA 12.5.40 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Found Python3: /home/fuser/miniconda3/envs/2bit/bin/python3.11 (found version "3.11.9") found components: Interpreter Development Development.Module Development.Embed -- Found CUDA: /usr/local/cuda (found version "12.5") -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.5.40") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE
-- Caffe2: CUDA detected: 12.5 -- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc -- Caffe2: CUDA toolkit directory: /usr/local/cuda -- Caffe2: Header version is: 12.5 -- /usr/local/cuda/lib64/libnvrtc.so shorthash is fb37b3f5 -- USE_CUDNN is set to 0. Compiling without cuDNN support -- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support -- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_80,code=compute_80 CMake Warning at **/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): static library kineto_LIBRARY-NOTFOUND not found. Call Stack (most recent call first): **/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found) CMakeLists.txt:22 (find_package)

-- Found Torch: **/python3.11/site-packages/torch/lib/libtorch.so
f"python3 found" f"torch found" -- Configuring done (11.4s) -- Generating done (0.1s) -- Build files have been written to: ****/decoupleQ/csrc/build

make -j [ 25%] Building CXX object CMakeFiles/decoupleQ_kernels.dir****/decoupleQ/dependencies/TensorRT-LLM/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp.o [ 33%] Building CXX object CMakeFiles/decoupleQ_kernels.dir****/decoupleQ/dependencies/TensorRT-LLM/cpp/tensorrt_llm/common/logger.cpp.o [ 33%] Building CXX object CMakeFiles/decoupleQ_kernels.dir****/decoupleQ/dependencies/TensorRT-LLM/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp.o [ 41%] Building CXX object CMakeFiles/decoupleQ_kernels.dir****/decoupleQ/dependencies/TensorRT-LLM/cpp/tensorrt_llm/common/stringUtils.cpp.o [ 50%] Building CXX object CMakeFiles/decoupleQ_kernels.dir****/decoupleQ/dependencies/TensorRT-LLM/cpp/tensorrt_llm/common/tllmException.cpp.o [ 50%] Building CUDA object CMakeFiles/decoupleQ_kernels.dir****/decoupleQ/dependencies/TensorRT-LLM/cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fp16_int2_gemm_fg_scalebias.cu.o [ 58%] Building CUDA object CMakeFiles/decoupleQ_kernels.dir/cutlass_kernel_file_2.generated.cu.o [ 83%] Building CUDA object CMakeFiles/decoupleQ_kernels.dir/w2a16.cu.o [ 83%] Building CUDA object CMakeFiles/decoupleQ_kernels.dir****/decoupleQ/dependencies/TensorRT-LLM/cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/bf16_int2_gemm_fg_scalebias.cu.o [ 83%] Building CUDA object CMakeFiles/decoupleQ_kernels.dir/cutlass_kernel_file_1.generated.cu.o ****/decoupleQ/csrc/w2a16.cu(153): warning #177-D: variable "quant_range_scale" was declared but never referenced const float quant_range_scale = 1.f / float(1 << (bits_in_type - 1)); ^

Remark: The warnings can be suppressed with "-diag-suppress "

****/decoupleQ/csrc/w2a16.cu(60): warning #177-D: variable "res_ptr" was declared but never referenced const T* res_ptr = nullptr; ^ detected during: instantiation of "TrtllmFpAIntBGemm<T, WeightType, QuantOp>::TrtllmFpAIntBGemm() [with T=half, WeightType=cutlass::uint2b_t, QuantOp=cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS]" at line 962 of /usr/include/c++/11/bits/unique_ptr.h instantiation of "std::_MakeUniq<_Tp>::__single_object std::make_unique<_Tp,_Args...>(_Args &&...) [with _Tp=TrtllmFpAIntBGemm<half, cutlass::uint2b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS>, _Args=<>]" at line 104

****/decoupleQ/csrc/w2a16.cu(153): warning #177-D: variable "quant_range_scale" was declared but never referenced const float quant_range_scale = 1.f / float(1 << (bits_in_type - 1)); ^

Remark: The warnings can be suppressed with "-diag-suppress "

****/decoupleQ/csrc/w2a16.cu(60): warning #177-D: variable "res_ptr" was declared but never referenced const T* res_ptr = nullptr; ^ detected during: instantiation of "TrtllmFpAIntBGemm<T, WeightType, QuantOp>::TrtllmFpAIntBGemm() [with T=half, WeightType=cutlass::uint2b_t, QuantOp=cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS]" at line 962 of /usr/include/c++/11/bits/unique_ptr.h instantiation of "std::_MakeUniq<_Tp>::__single_object std::make_unique<_Tp,_Args...>(_Args &&...) [with _Tp=TrtllmFpAIntBGemm<half, cutlass::uint2b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS>, _Args=<>]" at line 104

[ 91%] Linking CUDA device code CMakeFiles/decoupleQ_kernels.dir/cmake_device_link.o [100%] Linking CXX shared library libdecoupleQ_kernels.so [100%] Built target decoupleQ_kernels

cp libdecoupleQ_kernels.so ../../decoupleQ/decoupleQ_kernels.so
cd ../../

gavinchen430 commented 1 month ago

你这边测试的平台是A10吗？我看编译的是86架构的代码，编译的日志上看是都正常的我这边在A10上也是可以测试通过的，编译环境： cmake: 3.26.3 nvcc: 12.2 torch: 2.1.0 gpu Driver: 510.47.03

bytedance / decoupleQ

LinearA16 and LinearW2A16 output does not match #12