NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.63k stars 831 forks source link

Fail to build int4_awq on Mixtral 8x7b #1580

Open gloritygithub11 opened 2 months ago

gloritygithub11 commented 2 months ago

System Info

ubuntu 20.04 tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.10.0.dev2024050700

Who can help?

@Tracin

Information

Tasks

Reproduction


set -e

export MODEL_DIR=/mnt/memory
export MODEL_NAME=Mixtral-8x7B-Instruct-v0.1
export LD_LIBRARY_PATH=/usr/local/tensorrt/lib:$LD_LIBRARY_PATH
export PATH=/usr/local/tensorrt/bin:$PATH
export QUANTIZE=int4_awq
export DTYPE=bfloat16
export PYTHONPATH=/app/tensorrt-llm:$PYTHONPATH

python ../quantization/quantize.py \
     --model_dir $MODEL_DIR/${MODEL_NAME} \
     --output_dir $MODEL_DIR/tmp/trt_models/${MODEL_NAME}/$QUANTIZE/1-gpu \
     --dtype $DTYPE \
     --qformat $QUANTIZE \
     --calib_size 256 \
     --batch_size 8

# export CUDA_VISIBLE_DEVICES=0

trtllm-build \
    --checkpoint_dir $MODEL_DIR/tmp/trt_models/${MODEL_NAME}/$QUANTIZE/1-gpu \
    --output_dir $MODEL_DIR/tmp/trt_engines/${MODEL_NAME}/$QUANTIZE/1-gpu \
    --gemm_plugin $DTYPE \
    --max_batch_size 1 \
    --max_input_len 2048 \
    --max_output_len 1024

Expected behavior

trtllm-build could execute success

actual behavior

trtllm-build failed with following error: [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700 [05/12/2024-03:05:39] [TRT-LLM] [I] Set bert_attention_plugin to float16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set gemm_plugin to bfloat16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set nccl_plugin to float16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set lookup_plugin to None. [05/12/2024-03:05:39] [TRT-LLM] [I] Set lora_plugin to None. [05/12/2024-03:05:39] [TRT-LLM] [I] Set moe_plugin to float16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set context_fmha to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set paged_kv_cache to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set remove_input_padding to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set use_custom_all_reduce to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set multi_block_mode to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set enable_xqa to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set tokens_per_block to 128. [05/12/2024-03:05:39] [TRT-LLM] [I] Set use_paged_context_fmha to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set multiple_profiles to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set paged_state to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set streamingllm to False. [05/12/2024-03:05:39] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [05/12/2024-03:05:39] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py:964: UserWarning: The use of x.T on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider x.mT to transpose batches of matrices or x.permute(*torch.arange(x.ndim - 1, -1, -1)) to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.) weights[name] = preprocessor(param.T.contiguous(), Traceback (most recent call last): File "/app/venv_dev/bin/trtllm-build", line 8, in sys.exit(main()) File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 486, in main parallel_build(source, build_config, args.output_dir, workers, File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 370, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 329, in build_and_save engine = build_model(build_config, File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 305, in build_model model = load_model(rank_config, ckpt_dir, model_cls) File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 1100, in load_model preprocess_weights(weights, model_config) File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 964, in preprocess_weights weights[name] = preprocessor(param.T.contiguous(), File "/app/venv_dev/lib/python3.10/site-packages/torch/_ops.py", line 755, in call return self._op(*args, *(kwargs or {})) RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 7168 and num_col_bytes = 8. (/app/tensorrt-llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:278) 1 0x7f597e9b665a tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 102 2 0x7f5ba6d945dd void tensorrt_llm::kernels::cutlass_kernels::subbyte_transpose_impl<(tensorrt_llm::kernels::cutlass_kernels::QuantType)1>(signed char, signed char const, std::vector<unsigned long, std::allocator > const&) + 1085 3 0x7f5ba6d93735 tensorrt_llm::kernels::cutlass_kernels::subbyte_transpose(signed char, signed char const, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType) + 101 4 0x7f5ba6d93a4a tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char, signed char const, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 714 5 0x7f5ba6d6d7f4 torch_ext::preprocess_weights_for_mixed_gemm(at::Tensor, c10::ScalarType, c10::ScalarType) + 596 6 0x7f5ba6d7940a c10::impl::make_boxed_from_unboxedfunctor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor<at::Tensor ()(at::Tensor, c10::ScalarType, c10::ScalarType), at::Tensor, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType, c10::ScalarType> >, true>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator >) + 138 7 0x7f5b08bcb818 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator >) const + 568 8 0x7f5b0895c4f3 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr, std::allocator<std::shared_ptr > > const&, pybind11::args, pybind11::kwargs const&, std::optional) + 451 9 0x7f5b0895cd41 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr, std::allocator<std::shared_ptr > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optional) + 1329 10 0x7f5b08840833 /app/venv_dev/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x848833) [0x7f5b08840833] 11 0x7f5b0840bea4 /app/venv_dev/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x413ea4) [0x7f5b0840bea4] 12 0x53bd79 /app/venv_dev/bin/python3() [0x53bd79] 13 0x628a7b PyObject_Call + 491 14 0x5afa8e _PyEval_EvalFrameDefault + 24958 15 0x628d60 _PyFunction_Vectorcall + 592 16 0x62b899 _PyObject_FastCallDictTstate + 89 17 0x62b9ca _PyObject_Call_Prepend + 90 18 0x6e8da7 /app/venv_dev/bin/python3() [0x6e8da7] 19 0x629d24 _PyObject_MakeTpCall + 356 20 0x5ae9e9 _PyEval_EvalFrameDefault + 20697 21 0x628d60 _PyFunction_Vectorcall + 592 22 0x5a9c1b _PyEval_EvalFrameDefault + 779 23 0x628d60 _PyFunction_Vectorcall + 592 24 0x5a9c1b _PyEval_EvalFrameDefault + 779 25 0x628d60 _PyFunction_Vectorcall + 592 26 0x62893c PyObject_Call + 172 27 0x5ac51b _PyEval_EvalFrameDefault + 11275 28 0x628d60 _PyFunction_Vectorcall + 592 29 0x62893c PyObject_Call + 172 30 0x5ac51b _PyEval_EvalFrameDefault + 11275 31 0x628d60 _PyFunction_Vectorcall + 592 32 0x62893c PyObject_Call + 172 33 0x5ac51b _PyEval_EvalFrameDefault + 11275 34 0x628d60 _PyFunction_Vectorcall + 592 35 0x5a9c1b _PyEval_EvalFrameDefault + 779 36 0x5a8bf1 /app/venv_dev/bin/python3() [0x5a8bf1] 37 0x6d77cf PyEval_EvalCode + 127 38 0x6bb91b /app/venv_dev/bin/python3() [0x6bb91b] 39 0x6bb9a4 /app/venv_dev/bin/python3() [0x6bb9a4] 40 0x6bbde6 /app/venv_dev/bin/python3() [0x6bbde6] 41 0x6c0c84 _PyRun_SimpleFileObject + 404 42 0x6c0d57 _PyRun_AnyFileObject + 71 43 0x7042dd Py_RunMain + 877 44 0x7044bd Py_BytesMain + 45 45 0x7f5bab4e4083 __libc_start_main + 243 46 0x62ff4e _start + 46

additional notes

N/A

byshiue commented 2 months ago

Thank you for the report. INT4 AWQ is not supported on MoE model.

gloritygithub11 commented 2 months ago

Thanks @byshiue for the response. Will it be supported at sometime in future?

byshiue commented 2 months ago

We are working on the feature. We will update here if the feature is supported.

gloritygithub11 commented 1 month ago

@byshiue is there an expected date on this support?

nv-guomingz commented 1 month ago

Hi @gloritygithub11 could u please try to apply int4_awq on mixtral with latest code base, specifically, please using modelopt 0.11+ version.

gloritygithub11 commented 1 month ago

Hi @nv-guomingz, I still get the similar error:

set -ex

export MODEL_DIR=/models
export MODEL_NAME=Mixtral-8x7B-Instruct-v0.1
export QUANTIZE=int4_awq
export DTYPE=float16
export TORCH_CUDA_ARCH_LIST="8.0"

python3 ../quantization/quantize.py \
     --model_dir $MODEL_DIR/${MODEL_NAME} \
     --output_dir $MODEL_DIR/tmp/trt_models/${MODEL_NAME}/$QUANTIZE/1-gpu \
     --dtype $DTYPE \
     --qformat $QUANTIZE \
     --awq_block_size 128 \
     --calib_size 32

export CUDA_VISIBLE_DEVICES=0

trtllm-build \
    --checkpoint_dir $MODEL_DIR/tmp/trt_models/${MODEL_NAME}/$QUANTIZE/1-gpu \
    --output_dir $MODEL_DIR/tmp/trt_engines/${MODEL_NAME}/$QUANTIZE/1-gpu \
    --gemm_plugin $DTYPE \
    --max_batch_size 1 \
    --max_input_len 1024 \
    --max_output_len 2048
[06/03/2024-10:35:24] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Initializing model from /models/Mixtral-8x7B-Instruct-v0.1
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 19/19 [00:15<00:00,  1.21it/s]
[TensorRT-LLM][WARNING] The manually set model data type is torch.float16, but the data type of the HuggingFace model is torch.bfloat16.
Initializing tokenizer from /models/Mixtral-8x7B-Instruct-v0.1

AWQ calibration could take longer than other calibration methods. Please increase the batch size to speed up the calibration process. Batch size can be set by adding the argument --batch_size <batch_size> to the command line.

Loading calibration dataset
Downloading readme: 100%|█████████████████████████████████████████████████████████████████████| 15.6k/15.6k [00:00<00:00, 24.3MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████| 257M/257M [00:32<00:00, 7.99MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████| 257M/257M [00:39<00:00, 6.55MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████| 259M/259M [00:40<00:00, 6.36MB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████████████████| 34.7M/34.7M [00:04<00:00, 7.84MB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████████████████| 30.0M/30.0M [00:03<00:00, 7.67MB/s]
Generating train split: 100%|████████████████████████████████████████████████████| 287113/287113 [00:03<00:00, 82266.08 examples/s]
Generating validation split: 100%|█████████████████████████████████████████████████| 13368/13368 [00:00<00:00, 88769.18 examples/s]
Generating test split: 100%|███████████████████████████████████████████████████████| 11490/11490 [00:00<00:00, 87596.23 examples/s]
Starting quantization...
Inserted 2787 quantizers
Caching activation statistics for awq_lite...
Calibrating batch 0
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Calibrating batch 4
Calibrating batch 5
Calibrating batch 6
Calibrating batch 7
Calibrating batch 8
Calibrating batch 9
Calibrating batch 10
Calibrating batch 11
Calibrating batch 12
Calibrating batch 13
Calibrating batch 14
Calibrating batch 15
Calibrating batch 16
Calibrating batch 17
Calibrating batch 18
Calibrating batch 19
Calibrating batch 20
Calibrating batch 21
Calibrating batch 22
Calibrating batch 23
Calibrating batch 24
Calibrating batch 25
Calibrating batch 26
Calibrating batch 27
Calibrating batch 28
Calibrating batch 29
Calibrating batch 30
Calibrating batch 31
Searching awq_lite parameters...
Calibrating batch 0
/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py:163: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension modelopt_cuda_ext...
Loading extension modelopt_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py:165: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py:163: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Calibrating batch 4
Calibrating batch 5
Calibrating batch 6
Calibrating batch 7
Calibrating batch 8
Calibrating batch 9
Calibrating batch 10
Calibrating batch 11
Calibrating batch 12
Calibrating batch 13
Calibrating batch 14
Calibrating batch 15
Calibrating batch 16
Calibrating batch 17
Calibrating batch 18
Calibrating batch 19
Calibrating batch 20
Calibrating batch 21
Calibrating batch 22
Calibrating batch 23
Calibrating batch 24
Calibrating batch 25
Calibrating batch 26
Calibrating batch 27
Calibrating batch 28
Calibrating batch 29
Calibrating batch 30
Calibrating batch 31
Quantization done. Total time used: 711.29 s.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
current rank: 0, tp rank: 0, pp rank: 0
torch.distributed not initialized, assuming single world_size.
Quantized model exported to /models/tmp/trt_models/Mixtral-8x7B-Instruct-v0.1/int4_awq/1-gpu 
Total time used 221.65 s.
[06/03/2024-10:53:48] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
[06/03/2024-10:53:49] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set gemm_plugin to float16.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set nccl_plugin to auto.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set lookup_plugin to None.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set lora_plugin to None.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set moe_plugin to auto.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set context_fmha to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set paged_kv_cache to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set remove_input_padding to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set multi_block_mode to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set enable_xqa to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set multiple_profiles to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set paged_state to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set streamingllm to False.
[06/03/2024-10:53:49] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py:1047: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3675.)
  weights[name] = preprocessor(param.T.contiguous(),
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 499, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 379, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 338, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 310, in build_model
    model = load_model(rank_config, ckpt_dir, model_cls)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1184, in load_model
    preprocess_weights(weights,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1047, in preprocess_weights
    weights[name] = preprocessor(param.T.contiguous(),
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854, in __call__
    return self_._op(*args, **(kwargs or {}))
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 7168 and num_col_bytes = 8. (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:278)
1       0x7f0f58ca8d43 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82
2       0x7f1185b15952 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so(+0x7a952) [0x7f1185b15952]
3       0x7f1185b178ed tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char*, signed char const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 813
4       0x7f1185af7e5e torch_ext::preprocess_weights_for_mixed_gemm(at::Tensor, c10::ScalarType, c10::ScalarType) + 590
5       0x7f1185afe576 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor, c10::ScalarType, c10::ScalarType), at::Tensor, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType, c10::ScalarType> >, true>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 118
6       0x7f1085754028 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const + 568
7       0x7f10854e78d1 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args, pybind11::kwargs const&, std::optional<c10::DispatchKey>) + 449
8       0x7f10854e7fb1 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) + 1329
9       0x7f10853ccb63 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x8d1b63) [0x7f10853ccb63]
10      0x7f1084f78e04 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x47de04) [0x7f1084f78e04]
11      0x561419ffd10e /usr/bin/python3(+0x15a10e) [0x561419ffd10e]
12      0x56141a00c42b PyObject_Call + 187
13      0x561419fe85d7 _PyEval_EvalFrameDefault + 10791
14      0x561419ff2c14 _PyObject_FastCallDictTstate + 196
15      0x56141a00886c _PyObject_Call_Prepend + 92
16      0x56141a123700 /usr/bin/python3(+0x280700) [0x56141a123700]
17      0x561419ff3a7b _PyObject_MakeTpCall + 603
18      0x561419fec096 _PyEval_EvalFrameDefault + 25830
19      0x561419ffd9fc _PyFunction_Vectorcall + 124
20      0x561419fe753c _PyEval_EvalFrameDefault + 6540
21      0x561419ffd9fc _PyFunction_Vectorcall + 124
22      0x561419fe626d _PyEval_EvalFrameDefault + 1725
23      0x561419ffd9fc _PyFunction_Vectorcall + 124
24      0x56141a00c492 PyObject_Call + 290
25      0x561419fe85d7 _PyEval_EvalFrameDefault + 10791
26      0x561419ffd9fc _PyFunction_Vectorcall + 124
27      0x56141a00c492 PyObject_Call + 290
28      0x561419fe85d7 _PyEval_EvalFrameDefault + 10791
29      0x561419ffd9fc _PyFunction_Vectorcall + 124
30      0x56141a00c492 PyObject_Call + 290
31      0x561419fe85d7 _PyEval_EvalFrameDefault + 10791
32      0x561419ffd9fc _PyFunction_Vectorcall + 124
33      0x561419fe626d _PyEval_EvalFrameDefault + 1725
34      0x561419fe29c6 /usr/bin/python3(+0x13f9c6) [0x561419fe29c6]
35      0x56141a0d8256 PyEval_EvalCode + 134
36      0x56141a103108 /usr/bin/python3(+0x260108) [0x56141a103108]
37      0x56141a0fc9cb /usr/bin/python3(+0x2599cb) [0x56141a0fc9cb]
38      0x56141a102e55 /usr/bin/python3(+0x25fe55) [0x56141a102e55]
39      0x56141a102338 _PyRun_SimpleFileObject + 424
40      0x56141a101f83 _PyRun_AnyFileObject + 67
41      0x56141a0f4a5e Py_RunMain + 702
42      0x56141a0cb02d Py_BytesMain + 45
43      0x7f118f6c7d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f118f6c7d90]
44      0x7f118f6c7e40 __libc_start_main + 128
45      0x56141a0caf25 _start + 37

PS: lib versions

tensorrt                 10.0.1
tensorrt-cu12            10.0.1
tensorrt-cu12-bindings   10.0.1
tensorrt-cu12-libs       10.0.1
tensorrt-llm             0.11.0.dev2024052800
nvidia-modelopt          0.11.2
nv-guomingz commented 1 month ago

Hi @gloritygithub11, we'll have a look firstly and send the update later.

gloritygithub11 commented 1 month ago

hi @nv-guomingz is there update one the issue?

ChristianPala commented 1 month ago

Goog morning @nv-guomingz! We are also waiting for the fix. Cheers.

nv-guomingz commented 1 month ago

Goog morning @nv-guomingz! We are also waiting for the fix. Cheers.

@ChristianPala sorry for late response since the we're on holiday from 6/21~6/22, we'll send u update next week ASAP.

nv-guomingz commented 4 weeks ago

@ChristianPala we can reproduce this issue on our side and we're working on solve this issue by adding zero padding.

nv-guomingz commented 3 weeks ago

@Barry-Delaney would u please update the status of moe int4_awq supporting ?

Barry-Delaney commented 3 weeks ago

Hi @ChristianPala @gloritygithub11, int4_awq for MoE is still not implemented yet, we are working on the development and will update the status here it once it's ready. Thanks for your patience!

gloritygithub11 commented 1 week ago

Hi @Barry-Delaney , any updates on this?