NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

Failed to run convert_checkpoint.py with int8 weight-only quantization for Qwen2-72B-Instruct model #1833

Open frontword opened 3 days ago

frontword commented 3 days ago

System Info

CPU Architecture: x86_64 CPU/Host memory size: 1024Gi (1.0Ti) GPU properties: GPU name: NVIDIA GeForce RTX 4090 GPU mem size: 24Gb x 8 (192Gb)

Libraries TensorRT-LLM branch: 0.11.0.dev2024060400 TensorRT: 10.0.1 Transformers: 4.40.2 CUDA Version: 12.2 Driver Version: 535.146.02
OS: Ubuntu 22.04.4 LTS

container used: built container from tensorrtllm_backend main branch using dockerfile/Dockerfile.trt_llm_backend(commit 39ba55a745266bbc50cf19af0f5dfcad1c939c12)

Who can help?

@Tracin

Information

Tasks

Reproduction

  1. step1: DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

  2. step2: docker run \ -d \ --name triton-tensorrt-llm \ --net host \ --ipc=host \ --shm-size=128g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --gpus all \ -v /home/app/sharedir/nlp/models:/nlp_models \ -v /data1/workspace:/workspace \ triton_trt_llm:latest sleep 8640000

  3. step3: docker exec -it triton-tensorrt-llm bash

  4. step4: cd /app/tensorrt_llm/examples/qwen

  5. step5: python3 convert_checkpoint.py \ --model_dir /nlp_models/Qwen2-72B-Instruct \ --dtype float16 \ --qwen_type qwen2 \ --tp_size 8 \ --use_weight_only \ --weight_only_precision int8 \ --output_dir /workspace/models/trt_models/Qwen2-72B-Instruct/int8/8-gpu/

Expected behavior

I expect to convert the HF model to the tensorrt model with int8 weight-only quantization successfully

actual behavior

root@l117-11-p-ga:/workspace/code/new/new/llm-server# cd /app/tensorrt_llm/examples/qwen root@l117-11-p-ga:/app/tensorrt_llm/examples/qwen# python3 convert_checkpoint.py \

--model_dir /nlp_models/Qwen2-72B-Instruct \
--dtype float16 \
--qwen_type qwen2 \
--tp_size 8 \
--use_weight_only \
--weight_only_precision int8 \
--output_dir /workspace/models/trt_models/Qwen2-72B-Instruct/int8/8-gpu/

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400 0.11.0.dev2024060400 Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [01:13<00:00, 1.98s/it] [06/25/2024-10:29:57] We've detected an older driver with an RTX 4000 series GPU. These drivers have issues with P2P. This can affect the multi-gpu inference when using accelerate device_map.Please make sure to update your driver to the latest version which resolves this. Traceback (most recent call last): File "/app/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 376, in main() File "/app/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 368, in main convert_and_save_hf(args) File "/app/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 330, in convert_and_save_hf execute(args.workers, [convert_and_save_rank] world_size, args) File "/app/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 336, in execute f(args, rank) File "/app/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 316, in convert_and_save_rank qwen = from_hugging_face( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1124, in from_hugging_face weights = load_weights_from_hf(config=config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1232, in load_weights_from_hf weights = convert_hf_qwen( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 881, in convert_hf_qwen get_tllm_linear_weight(split_v, tllm_prex + 'mlp.gate.', None, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 486, in get_tllm_linear_weight torch.ops.trtllm.symmetric_quantize_last_axis_of_batched_matrix( File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 854, in call return self._op(args, *(kwargs or {})) RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 8192 and num_col_bytes = 3696. (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:278) 1 0x7fc8c4e4d55a tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 102 2 0x7fcb7dc22fad /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so(+0x7afad) [0x7fcb7dc22fad] 3 0x7fcb7dc24dbd tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char, signed char const, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 877 4 0x7fcb7dc2b427 void tensorrt_llm::kernels::cutlass_kernels::symmetric_quantize<half, half>(signed char, signed char, half*, half const, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 1319 5 0x7fcb7dc023b5 torch_ext::symmetric_quantize_helper(at::Tensor, c10::ScalarType, bool) + 2293 6 0x7fcb7dc02621 torch_ext::symmetric_quantize_last_axis_of_batched_matrix(at::Tensor, c10::ScalarType) + 65 7 0x7fcb7dc0975d c10::impl::make_boxed_from_unboxedfunctor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor<std::vector<at::Tensor, std::allocator > ()(at::Tensor, c10::ScalarType), std::vector<at::Tensor, std::allocator >, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType> >, true>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator >) + 141 8 0x7fc9f1f54028 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator >*) const + 568 9 0x7fc9f1ce78d1 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr, std::allocator<std::shared_ptr > > const&, pybind11::args, pybind11::kwargs const&, std::optional) + 449 10 0x7fc9f1ce7fb1 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr, std::allocator<std::shared_ptr > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optional) + 1329 11 0x7fc9f1bccb63 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x8d1b63) [0x7fc9f1bccb63] 12 0x7fc9f1778e04 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x47de04) [0x7fc9f1778e04] 13 0x56409b3cf10e python3(+0x15a10e) [0x56409b3cf10e] 14 0x56409b3de42b PyObject_Call + 187 15 0x56409b3ba5d7 _PyEval_EvalFrameDefault + 10791 16 0x56409b3c4c14 _PyObject_FastCallDictTstate + 196 17 0x56409b3da86c _PyObject_Call_Prepend + 92 18 0x56409b4f5700 python3(+0x280700) [0x56409b4f5700] 19 0x56409b3c5a7b _PyObject_MakeTpCall + 603 20 0x56409b3be629 _PyEval_EvalFrameDefault + 27257 21 0x56409b3cf9fc _PyFunction_Vectorcall + 124 22 0x56409b3b826d _PyEval_EvalFrameDefault + 1725 23 0x56409b3cf9fc _PyFunction_Vectorcall + 124 24 0x56409b3de492 PyObject_Call + 290 25 0x56409b3ba5d7 _PyEval_EvalFrameDefault + 10791 26 0x56409b3cf9fc _PyFunction_Vectorcall + 124 27 0x56409b3b953c _PyEval_EvalFrameDefault + 6540 28 0x56409b3cf9fc _PyFunction_Vectorcall + 124 29 0x56409b3b953c _PyEval_EvalFrameDefault + 6540 30 0x56409b3cf9fc _PyFunction_Vectorcall + 124 31 0x56409b3b826d _PyEval_EvalFrameDefault + 1725 32 0x56409b3cf9fc _PyFunction_Vectorcall + 124 33 0x56409b3b826d _PyEval_EvalFrameDefault + 1725 34 0x56409b3cf9fc _PyFunction_Vectorcall + 124 35 0x56409b3b826d _PyEval_EvalFrameDefault + 1725 36 0x56409b3cf9fc _PyFunction_Vectorcall + 124 37 0x56409b3b826d _PyEval_EvalFrameDefault + 1725 38 0x56409b3b49c6 python3(+0x13f9c6) [0x56409b3b49c6] 39 0x56409b4aa256 PyEval_EvalCode + 134 40 0x56409b4d5108 python3(+0x260108) [0x56409b4d5108] 41 0x56409b4ce9cb python3(+0x2599cb) [0x56409b4ce9cb] 42 0x56409b4d4e55 python3(+0x25fe55) [0x56409b4d4e55] 43 0x56409b4d4338 _PyRun_SimpleFileObject + 424 44 0x56409b4d3f83 _PyRun_AnyFileObject + 67 45 0x56409b4c6a5e Py_RunMain + 702 46 0x56409b49d02d Py_BytesMain + 45 47 0x7fcb884a9d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fcb884a9d90] 48 0x7fcb884a9e40 __libc_start_main + 128 49 0x56409b49cf25 _start + 37 root@l117-11-p-ga:/app/tensorrt_llm/examples/qwen#

additional notes

No problem for Qwen2-7B-Instruct model when run run convert_checkpoint.py with int8 weight-only quantization

nv-guomingz commented 3 days ago

@Barry-Delaney Would u please take a look on this issue?

Barry-Delaney commented 2 days ago

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 8192 and num_col_bytes = 3696.

@frontword thanks for the feedback. This is because the quantize OP require intermediate_size to be a multiple of 32, however the GEMM shape after divided by TP8 (29568 / 8 = 115.5 * 32) cannot satisfy it. Currently, the supported maximum of TP for this model is TP4. We are going to add padding logic to solve similar issues in the future.

frontword commented 2 days ago

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 8192 and num_col_bytes = 3696.

@frontword thanks for the feedback. This is because the quantize OP require intermediate_size to be a multiple of 32, however the GEMM shape after divided by TP8 (29568 / 8 = 115.5 * 32) cannot satisfy it. Currently, the supported maximum of TP for this model is TP4. We are going to add padding logic to solve similar issues in the future.

@Barry-Delaney thank you for your answer. I can succeed to quantize the model with int8 weight-only precision using the quantize.py following the below command, then which method is suggested to use, using the quantize.py or using the convert_checkpoint.py ?

python3 ../quantization/quantize.py --model_dir /nlp_models/Qwen2-72B-Instruct \ --dtype float16 \ --qformat int8_wo \ --kv_cache_dtype int8 \ --output_dir /workspace/models/trt_models/Qwen2-72B-Instruct/int8/8-gpu \ --tp_size 8

though the above command can be executed successfully, but, the same error appeared when run trtllm-build

trtllm-build --checkpoint_dir /workspace/models/trt_models/Qwen2-72B-Instruct/int8/8-gpu \ --output_dir /workspace/models/trt_engine/Qwen2-72B-Instruct/int8/8-gpu \ --gemm_plugin float16 \ --max_input_len 4096 \ --max_output_len 1024 \ --max_batch_size 4

Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 489, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 368, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 327, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 299, in build_model model = model_cls.from_checkpoint(ckpt_dir, config=rank_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 409, in from_checkpoint preprocess_weights(weights, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1122, in preprocess_weights weights = weight_only_quantize_dict(weights=weights, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/convert_utils.py", line 77, in weight_only_quantize_dict quant_weight, quant_scale = weight_only_quantize( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/convert_utils.py", line 57, in weight_only_quantize torch.ops.trtllm.symmetric_quantize_last_axis_of_batched_matrix( File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 854, in call return self._op(*args, *(kwargs or {})) RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 8192 and num_col_bytes = 3696. (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:278) 1 0x7f493624d55a tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 102 2 0x7f4beb021fad /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so(+0x7afad) [0x7f4beb021fad] 3 0x7f4beb023dbd tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char, signed char const, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 877

Barry-Delaney commented 1 day ago

@frontword convert_checkpoint.py is TRT-LLM built-in conversion logic and quantize.py will call modelopt for quantization. If you are using INT8 KV cache, the first one won't work as calibration is required. So for your case, quantize.py is recommended.

The conversion phase with modelopt won't check the tensors' shape, that's why you will run to the same assertion in the build phase. For now, to build the engine successfully, you still need to reduce the TP number or try to padding on the intermediate_size.

frontword commented 1 day ago

After reduce the TP number to 4, whether using convert_checkpoint.py or using quantize.py, both of them can not be successful, need to investigate how to padding on the intermediate_size

Barry-Delaney commented 1 day ago

both of them can not be successful

Could you please provide the error log? Thx!