NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

Error: MOE-FP8 quantize Integer divide-by-zero in H20 (llama-70B fp8 quantize is fine) #1980

Open joerong666 opened 1 month ago

joerong666 commented 1 month ago
          > closed, confirmed that it was fixed in 0.11.0.dev2024060400

Hi @hijkzzz,I meet the same problem for MOE (including 8x22B and 8x7B) fp8 quantization in H20, even after upgrading to 0.11.0.dev2024060400 (neither in v0.12.0.dev2024070900). However, it is fine for llama-70B fp8 quantization. Here is my environment:

NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4 

pip list |grep tensor
cutlass_library          3.5.0                /app/tensorrt_llm/3rdparty/cutlass/python
safetensors              0.4.3
tensorrt                 10.0.1
tensorrt_llm             0.11.0.dev2024060400

pip list |grep cuda
cuda-python              12.5.0
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105

Here is my quantization command:

python3 /app/tensorrt_llm/examples/quantization/quantize.py --model_dir /online/corefile/llm-benchmark/data/raw/Mixtral-8x22B-v0.1 --output_dir /tmp/llm/tensorrt/tmp/trt-build/trt_checkpoint/open-trt-Mixtral-8x22B-v0.1-fp8-A8W8C8/v0.11.0.dev2024060400/1/tp8 --dtype float16 --qformat fp8 --calib_size 512 --dtype float16 --tp_size 8 --pp_size 1 --kv_cache_dtype fp8

Any suggestion is appreciated !

Below is my detail error:

> Caught signal 8 (Floating point exception: integer divide by zero)
>  backtrace (tid:    157) 
>  0 0x0000000000042520 __sigaction()  ???:0
>  1 0x0000000000a0bc59 cublasLt_for_cublas_ZZZ()  ???:0
>  2 0x0000000000814383 cublasLt_for_cublas_ZZZ()  ???:0
>  3 0x00000000006ace72 cublasLtLegacyGemmUtilizationZZZ()  ???:0
>  4 0x00000000007aa087 cublasLtMatmulAlgoCheck()  ???:0
>  5 0x00000000007ab055 cublasLtMatmulAlgoCheck()  ???:0
>  6 0x00000000007abd2e cublasLtMatmulAlgoCheck()  ???:0
>  7 0x00000000007bd046 cublasLtHSHMatmulAlgoGetHeuristic()  ???:0
>  8 0x000000000085d43a cublasXerbla()  ???:0
>  9 0x000000000085deec cublasXerbla()  ???:0
> 10 0x0000000000860122 cublasXerbla()  ???:0
> 11 0x00000000008432ef cublasXerbla()  ???:0
> 12 0x0000000000ac7ecf cublasUint8gemmBias()  ???:0
> 13 0x0000000000ac83d8 cublasUint8gemmBias()  ???:0
> 14 0x00000000003e1c7d cublasGemmEx()  ???:0
> 15 0x0000000003593c91 at::cuda::blas::gemm_internal<c10::Half>()  :0
> 16 0x000000000359c6e7 at::cuda::blas::gemm<c10::Half>()  :0
> 17 0x00000000035fffa4 at::native::(anonymous namespace)::addmm_out_cuda_impl()  Blas.cpp:0
> 18 0x000000000360048a at::native::structured_mm_out_cuda::impl()  ???:0
> 19 0x000000000334a222 at::(anonymous namespace)::wrapper_CUDA_mm()  RegisterCUDA.cpp:0
> 20 0x000000000334a2e0 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CUDA_mm>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call()  RegisterCUDA.cpp:0
> 21 0x00000000029c95fe at::_ops::mm::redispatch()  ???:0
> 22 0x00000000047eb2d3 torch::autograd::VariableType::(anonymous namespace)::mm()  VariableType_3.cpp:0
> 23 0x00000000047ebeb3 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call()  VariableType_3.cpp:0
> 24 0x0000000002a1b2be at::_ops::mm::call()  ???:0
> 25 0x0000000001d8c020 at::native::_matmul_impl()  LinearAlgebra.cpp:0
> 26 0x0000000001d94d09 at::native::matmul()  ???:0
> 27 0x0000000002fdd5e0 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__matmul>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call()  RegisterCompositeImplicitAutograd.cpp:0
> 28 0x0000000002b4abde at::_ops::matmul::call()  ???:0
> 29 0x0000000001d7b9c3 at::native::linear()  ???:0
> 30 0x0000000002fdd373 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__linear>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&> >, at::Tensor (at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&)>::call()  RegisterCompositeImplicitAutograd.cpp:0
> 31 0x000000000255af6c at::_ops::linear::call()  ???:0
> 32 0x00000000006fe555 torch::autograd::THPVariable_linear()  python_nn_functions.cpp:0
> 33 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
> 34 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
> 35 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
> 36 0x000000000016893e PyMethod_New()  ???:0
> 37 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
> 38 0x000000000016893e PyMethod_New()  ???:0
> 39 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
> 40 0x000000000016893e PyMethod_New()  ???:0
> 41 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
> 42 0x00000000001560d8 _PyType_LookupId()  ???:0
> 43 0x0000000000276ad8 PyEval_GetLocals()  ???:0
> 44 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
> 45 0x000000000016893e PyMethod_New()  ???:0
> 46 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
> 47 0x000000000014fc14 _PyObject_FastCallDictTstate()  ???:0
> 48 0x000000000016586c _PyObject_Call_Prepend()  ???:0
> 49 0x0000000000280700 PyInit__datetime()  ???:0
> 50 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
> 51 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
> 52 0x000000000016893e PyMethod_New()  ???:0
> 53 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
> 54 0x00000000001560d8 _PyType_LookupId()  ???:0
> 55 0x0000000000276ad8 PyEval_GetLocals()  ???:0
> 56 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0

> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] *** Process received signal ***
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] Signal: Floating point exception (8)
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] Signal code:  (-6)
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] Failing at address: 0x9d
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f258a4b6520]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [ 1] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+0xa0bc59)[0x7f24eea0bc59]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [ 2] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+0x814383)[0x7f24ee814383]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [ 3] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+0x6ace72)[0x7f24ee6ace72]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [ 4] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+0x7aa087)[0x7f24ee7aa087]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [ 5] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+0x7ab055)[0x7f24ee7ab055]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [ 6] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+0x7abd2e)[0x7f24ee7abd2e]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [ 7] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(cublasLtHSHMatmulAlgoGetHeuristic+0x516)[0x7f24ee7bd046]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [ 8] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.12(+0x85d43a)[0x7f251085d43a]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [ 9] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.12(+0x85deec)[0x7f251085deec]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [10] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.12(+0x860122)[0x7f2510860122]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [11] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.12(+0x8432ef)[0x7f25108432ef]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [12] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.12(+0xac7ecf)[0x7f2510ac7ecf]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [13] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.12(+0xac83d8)[0x7f2510ac83d8]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [14] /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.12(cublasGemmEx+0x13d)[0x7f25103e1c7d]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [15] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0x3593c91)[0x7f253fa38c91]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [16] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0x359c6e7)[0x7f253fa416e7]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [17] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0x35fffa4)[0x7f253faa4fa4]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [18] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(_ZN2at6native22structured_mm_out_cuda4implERKNS_6TensorES4_S4_+0x4a)[0x7f253faa548a]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [19] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0x334a222)[0x7f253f7ef222]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [20] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0x334a2e0)[0x7f253f7ef2e0]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [21] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN2at4_ops2mm10redispatchEN3c1014DispatchKeySetERKNS_6TensorES6_+0x6e)[0x7f2572d1e5fe]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [22] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x47eb2d3)[0x7f2574b402d3]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [23] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x47ebeb3)[0x7f2574b40eb3]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [24] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN2at4_ops2mm4callERKNS_6TensorES4_+0x15e)[0x7f2572d702be]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [25] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x1d8c020)[0x7f25720e1020]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [26] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN2at6native6matmulERKNS_6TensorES3_+0x49)[0x7f25720e9d09]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [27] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x2fdd5e0)[0x7f25733325e0]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [28] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN2at4_ops6matmul4callERKNS_6TensorES4_+0x15e)[0x7f2572e9fbde]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] [29] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN2at6native6linearERKNS_6TensorES3_RKSt8optionalIS1_E+0x283)[0x7f25720d09c3]
> [llm-bm-infer-service-prod-1721282032835mehv-3447516585:00157] *** End of error message ***
> /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

Originally posted by @joerong666 in https://github.com/NVIDIA/TensorRT-LLM/issues/1645#issuecomment-2235833820

Malfurionzz commented 1 month ago

Hi there,I met the same problem. It works for me to build pytorch from source (cuda 12.4). You may have a try~

zxs789 commented 1 month ago

Hi there,I met the same problem. It works for me to build pytorch from source (cuda 12.4). You may have a try~

Hi,have you met this error with cuda_graph?#1948