NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
294 stars 15 forks source link

Error Converting checkpoint after INT4AWQ quantization #13

Closed christian-ci closed 1 month ago

christian-ci commented 1 month ago

Hi. I ran the command:

export HF_PATH="mistralai/Mixtral-8x7B-Instruct-v0.1"
scripts/huggingface_example.sh --type llama --model $HF_PATH --quant int4_awq --tp 4

On a Node with 8 A100 GPUs. I set TP to 4 because I want to build the engine for 4 GPU down the flow. The model was successfully quantized but when it starts the converstion of the checkpoint to TensorRT-LLM engine it throws this error:

Quantized model exported to :/workspace/examples/llm_ptq/saved_models_Mixtral-8x7B-Instruct-v0_dense_int4_awq_tp4_pp1. Total time used 551.7474279403687s
+ '[' llama == mixtral ']'
+ '[' llama == llava ']'
+ echo 'Building tensorrt_llm engine from Model Optimizer-quantized model...'
Building tensorrt_llm engine from Model Optimizer-quantized model...
+ python modelopt_to_tensorrt_llm.py --model_config=/workspace/examples/llm_ptq/saved_models_Mixtral-8x7B-Instruct-v0_dense_int4_awq_tp4_pp1/config.json --engine_dir=/workspace/examples/llm_ptq/saved_models_Mixtral-8x7B-Instruct-v0_dense_int4_awq_tp4_pp1/llama_4x1xNVIDIA_A100-SXM4-80GB_input2048_output512_batch2_engine --tokenizer=mistralai/Mixtral-8x7B-Instruct-v0.1 --max_input_len=2048 --max_output_len=512 --max_batch_size=2 --num_build_workers=4 --enable_sparsity=false
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
Loaded mpi lib /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so successfully
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
Loaded mpi lib /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so successfully
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
Loaded mpi lib /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so successfully
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
Loaded mpi lib /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so successfully
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
Loaded mpi lib /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so successfully
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py:912: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.)
  weights[name] = preprocessor(param.T.contiguous(),
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py:912: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.)
  weights[name] = preprocessor(param.T.contiguous(),
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py:912: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.)
  weights[name] = preprocessor(param.T.contiguous(),
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/deploy/llm/model_config_trt.py", line 285, in _build_tensorrt_llm_rank
    success = build_and_save(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 291, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 268, in build_model
    model = load_model(rank_config, ckpt_dir, model_cls)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1040, in load_model
    preprocess_weights(weights, model_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 912, in preprocess_weights
    weights[name] = preprocessor(param.T.contiguous(),
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 1792 and num_col_bytes = 8. (/home/jenkins/agent/workspace/LLM/release-0.9/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:258)
1       0x7fa4f8759dfb tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82
2       0x7fa65ad97a8f void tensorrt_llm::kernels::cutlass_kernels::subbyte_transpose_impl<(tensorrt_llm::kernels::cutlass_kernels::QuantType)1>(signed char*, signed char const*, std::vector<unsigned long, std::allocator<unsigned long> > const&) + 1039
3       0x7fa65ad970bd tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char*, signed char const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 797
4       0x7fa65ad7b4e1 torch_ext::preprocess_weights_for_mixed_gemm(at::Tensor, c10::ScalarType) + 561
5       0x7fa65ad816a8 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor, c10::ScalarType), at::Tensor, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType> >, true>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 104
6       0x7fa70cf5e818 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const + 568
7       0x7fa70ccef4f3 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args, pybind11::kwargs const&, std::optional<c10::DispatchKey>) + 451
8       0x7fa70ccefd41 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) + 1329
9       0x7fa70cbd3833 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x848833) [0x7fa70cbd3833]
10      0x7fa70c79eea4 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x413ea4) [0x7fa70c79eea4]
11      0x564076d1710e /usr/bin/python(+0x15a10e) [0x564076d1710e]
12      0x564076d2642b PyObject_Call + 187
13      0x564076d025d7 _PyEval_EvalFrameDefault + 10791
14      0x564076d0cc14 _PyObject_FastCallDictTstate + 196
15      0x564076d2286c _PyObject_Call_Prepend + 92
16      0x564076e3d700 /usr/bin/python(+0x280700) [0x564076e3d700]
17      0x564076d0da7b _PyObject_MakeTpCall + 603
18      0x564076d06096 _PyEval_EvalFrameDefault + 25830
19      0x564076d179fc _PyFunction_Vectorcall + 124
20      0x564076d0026d _PyEval_EvalFrameDefault + 1725
21      0x564076d179fc _PyFunction_Vectorcall + 124
22      0x564076d0026d _PyEval_EvalFrameDefault + 1725
23      0x564076d179fc _PyFunction_Vectorcall + 124
24      0x564076d26492 PyObject_Call + 290
25      0x564076d025d7 _PyEval_EvalFrameDefault + 10791
26      0x564076d179fc _PyFunction_Vectorcall + 124
27      0x564076d26492 PyObject_Call + 290
28      0x564076d025d7 _PyEval_EvalFrameDefault + 10791
29      0x564076d179fc _PyFunction_Vectorcall + 124
30      0x564076d26492 PyObject_Call + 290
31      0x564076d025d7 _PyEval_EvalFrameDefault + 10791
32      0x564076d179fc _PyFunction_Vectorcall + 124
33      0x564076d025d7 _PyEval_EvalFrameDefault + 10791
34      0x564076d179fc _PyFunction_Vectorcall + 124
35      0x564076d0045c _PyEval_EvalFrameDefault + 2220
36      0x564076d179fc _PyFunction_Vectorcall + 124
37      0x564076d0045c _PyEval_EvalFrameDefault + 2220
38      0x564076d179fc _PyFunction_Vectorcall + 124
39      0x564076d0026d _PyEval_EvalFrameDefault + 1725
40      0x564076d179fc _PyFunction_Vectorcall + 124
41      0x564076d0153c _PyEval_EvalFrameDefault + 6540
42      0x564076cfc9c6 /usr/bin/python(+0x13f9c6) [0x564076cfc9c6]
43      0x564076df2256 PyEval_EvalCode + 134
44      0x564076e1d108 /usr/bin/python(+0x260108) [0x564076e1d108]
45      0x564076e169cb /usr/bin/python(+0x2599cb) [0x564076e169cb]
46      0x564076e0fab1 PyRun_StringFlags + 129
47      0x564076e0f961 PyRun_SimpleStringFlags + 65
48      0x564076e0eb15 Py_RunMain + 885
49      0x564076de502d Py_BytesMain + 45
50      0x7fa70eb56d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fa70eb56d90]
51      0x7fa70eb56e40 __libc_start_main + 128
52      0x564076de4f25 _start + 37
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/modelopt/deploy/llm/model_config_trt.py", line 120, in build_tensorrt_llm
    future.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 1792 and num_col_bytes = 8. (/home/jenkins/agent/workspace/LLM/release-0.9/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:258)

This error is thrown no matter the machine. Like I quantized in 8 GPUs A100-80GB and it throws the error there but also throws the error when trying to build the engine on a A10 as well. Same error.

christian-ci commented 1 month ago

Update. I thought I was wrong by using model type as llama instead of mixtral but the only diff is the lib doesn't build the engine and directs you to use the TensorRT-LLM build. I also tried:

trtllm-build --checkpoint_dir /shared/mixtral-8x7b-awq-instruct-base-trt-tp4 \
                 --output_dir /shared/mixtral-8x7b-awq-instruct-engine-trt-tp4 \
                 --gemm_plugin float16
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
[05/21/2024-19:55:47] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set gemm_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set nccl_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set lookup_plugin to None.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set lora_plugin to None.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set moe_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set context_fmha to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set remove_input_padding to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set multi_block_mode to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set enable_xqa to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set tokens_per_block to 64.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set multiple_profiles to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set paged_state to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set streamingllm to False.
[05/21/2024-19:55:47] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[05/21/2024-19:55:47] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py:1013: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.)
  weights[name] = preprocessor(param.T.contiguous(),
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 496, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 377, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 336, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in build_model
    model = load_model(rank_config, ckpt_dir, model_cls)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1150, in load_model
    preprocess_weights(weights,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1013, in preprocess_weights
    weights[name] = preprocessor(param.T.contiguous(),
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 1792 and num_col_bytes = 8. (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:278)

Same error. So we can't build the engine with the checkpoint from the quantization of this lib. Here is the generated config.json

{
    "producer": {
        "name": "modelopt",
        "version": "0.11.2"
    },
    "architecture": "LlamaForCausalLM",
    "dtype": "float16",
    "num_hidden_layers": 32,
    "num_attention_heads": 32,
    "num_key_value_heads": 8,
    "hidden_size": 4096,
    "norm_epsilon": 1e-05,
    "vocab_size": 32000,
    "max_position_embeddings": 32768,
    "hidden_act": "swiglu",
    "use_parallel_embedding": true,
    "embedding_sharding_dim": 0,
    "quantization": {
        "quant_algo": "W4A16_AWQ",
        "kv_cache_quant_algo": "FP8",
        "group_size": 128,
        "has_zero_point": false,
        "pre_quant_scale": true,
        "exclude_modules": [
            "lm_head"
        ]
    },
    "mapping": {
        "world_size": 4,
        "tp_size": 4,
        "pp_size": 1
    },
    "head_size": 128,
    "intermediate_size": 14336,
    "position_embedding_type": "rope_gpt_neox",
    "share_embedding_table": false,
    "residual_mlp": false,
    "bias": false,
    "rotary_pct": 1.0,
    "rank": 0,
    "decoder": "llama",
    "rmsnorm": true,
    "lm_head_bias": false,
    "moe_num_experts": 8,
    "moe_top_k": 2
}
cjluo-omniml commented 1 month ago

I believe the int4 awq of mixtral has not been supported in the public TRT LLM release yet. In the coming TRT LLM release, we will support fp8 first and maybe int4 awq later.

christian-ci commented 1 month ago

I believe the int4 awq of mixtral has not been supported in the public TRT LLM release yet. In the coming TRT LLM release, we will support fp8 first and maybe int4 awq later.

@cjluo-omniml Ok. Thanks for the info but why support fp8 first which is only supported by Ada and Hopper archs (Much harder to get anywhere and more expensive) than int4awq which can run on all Ampere's which are more available like A10G in AWS or any A100?

cjluo-omniml commented 1 month ago

For LLM serving, especially enterprise LLM serving, large batch throughput is the focus for optimization. int4_awq is usually good at low batch size but the performance gain diminishes compared with FP8 when the batch size increases.

christian-ci commented 1 month ago

For LLM serving, especially enterprise LLM serving, large batch throughput is the focus for optimization. int4_awq is usually good at low batch size but the performance gain diminishes compared with FP8 when the batch size increases.

@cjluo-omniml Yes, we are aware of this but it's incredibly hard to get FP8 compatible cards or nodes in most cloud providers. In our case AWS which only offers A10G nodes for inference and can be obtained but getting an H100 node for inference loads is almost not possible. I guarantee you that there are other companies like us that fall in this category. We are part of the NVIDIA inception program if that helps. Do you have any suggestions for being able to use A10G with Tensor Parallelism and Flash attention in A10G or A100s?

cjluo-omniml commented 1 month ago

For A10G, we recommend you try int8 smoothquant or int4 awq. For Mixtral support, so far the quantization development is focused on FP8 and int4 awq.

As to: Do you have any suggestions for being able to use A10G with Tensor Parallelism and Flash attention in A10G or A100s? Absolutely. I believe you can try FP16 Mixtral with TRT LLM as well if you have enough GPU memory on a single node.

christian-ci commented 1 month ago

@cjluo-omniml Thanks. We will try FP16 because we are super sensitive to accuracy and guess to wait for int4_awq release and support so we can reduce the size of the node etc. We would really appreciate if its at least released on main or in tandem with FP8 on the cut releases. Thanks again!