Open gloritygithub11 opened 2 months ago
Thank you for the report. INT4 AWQ is not supported on MoE model.
Thanks @byshiue for the response. Will it be supported at sometime in future?
We are working on the feature. We will update here if the feature is supported.
@byshiue is there an expected date on this support?
Hi @gloritygithub11 could u please try to apply int4_awq on mixtral with latest code base, specifically, please using modelopt 0.11+ version.
Hi @nv-guomingz, I still get the similar error:
set -ex
export MODEL_DIR=/models
export MODEL_NAME=Mixtral-8x7B-Instruct-v0.1
export QUANTIZE=int4_awq
export DTYPE=float16
export TORCH_CUDA_ARCH_LIST="8.0"
python3 ../quantization/quantize.py \
--model_dir $MODEL_DIR/${MODEL_NAME} \
--output_dir $MODEL_DIR/tmp/trt_models/${MODEL_NAME}/$QUANTIZE/1-gpu \
--dtype $DTYPE \
--qformat $QUANTIZE \
--awq_block_size 128 \
--calib_size 32
export CUDA_VISIBLE_DEVICES=0
trtllm-build \
--checkpoint_dir $MODEL_DIR/tmp/trt_models/${MODEL_NAME}/$QUANTIZE/1-gpu \
--output_dir $MODEL_DIR/tmp/trt_engines/${MODEL_NAME}/$QUANTIZE/1-gpu \
--gemm_plugin $DTYPE \
--max_batch_size 1 \
--max_input_len 1024 \
--max_output_len 2048
[06/03/2024-10:35:24] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Initializing model from /models/Mixtral-8x7B-Instruct-v0.1
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 19/19 [00:15<00:00, 1.21it/s]
[TensorRT-LLM][WARNING] The manually set model data type is torch.float16, but the data type of the HuggingFace model is torch.bfloat16.
Initializing tokenizer from /models/Mixtral-8x7B-Instruct-v0.1
AWQ calibration could take longer than other calibration methods. Please increase the batch size to speed up the calibration process. Batch size can be set by adding the argument --batch_size <batch_size> to the command line.
Loading calibration dataset
Downloading readme: 100%|█████████████████████████████████████████████████████████████████████| 15.6k/15.6k [00:00<00:00, 24.3MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████| 257M/257M [00:32<00:00, 7.99MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████| 257M/257M [00:39<00:00, 6.55MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████| 259M/259M [00:40<00:00, 6.36MB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████████████████| 34.7M/34.7M [00:04<00:00, 7.84MB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████████████████| 30.0M/30.0M [00:03<00:00, 7.67MB/s]
Generating train split: 100%|████████████████████████████████████████████████████| 287113/287113 [00:03<00:00, 82266.08 examples/s]
Generating validation split: 100%|█████████████████████████████████████████████████| 13368/13368 [00:00<00:00, 88769.18 examples/s]
Generating test split: 100%|███████████████████████████████████████████████████████| 11490/11490 [00:00<00:00, 87596.23 examples/s]
Starting quantization...
Inserted 2787 quantizers
Caching activation statistics for awq_lite...
Calibrating batch 0
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Calibrating batch 4
Calibrating batch 5
Calibrating batch 6
Calibrating batch 7
Calibrating batch 8
Calibrating batch 9
Calibrating batch 10
Calibrating batch 11
Calibrating batch 12
Calibrating batch 13
Calibrating batch 14
Calibrating batch 15
Calibrating batch 16
Calibrating batch 17
Calibrating batch 18
Calibrating batch 19
Calibrating batch 20
Calibrating batch 21
Calibrating batch 22
Calibrating batch 23
Calibrating batch 24
Calibrating batch 25
Calibrating batch 26
Calibrating batch 27
Calibrating batch 28
Calibrating batch 29
Calibrating batch 30
Calibrating batch 31
Searching awq_lite parameters...
Calibrating batch 0
/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py:163: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension modelopt_cuda_ext...
Loading extension modelopt_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py:165: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py:163: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
self.register_buffer("_pre_quant_scale", torch.tensor(value))
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Calibrating batch 4
Calibrating batch 5
Calibrating batch 6
Calibrating batch 7
Calibrating batch 8
Calibrating batch 9
Calibrating batch 10
Calibrating batch 11
Calibrating batch 12
Calibrating batch 13
Calibrating batch 14
Calibrating batch 15
Calibrating batch 16
Calibrating batch 17
Calibrating batch 18
Calibrating batch 19
Calibrating batch 20
Calibrating batch 21
Calibrating batch 22
Calibrating batch 23
Calibrating batch 24
Calibrating batch 25
Calibrating batch 26
Calibrating batch 27
Calibrating batch 28
Calibrating batch 29
Calibrating batch 30
Calibrating batch 31
Quantization done. Total time used: 711.29 s.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
current rank: 0, tp rank: 0, pp rank: 0
torch.distributed not initialized, assuming single world_size.
Quantized model exported to /models/tmp/trt_models/Mixtral-8x7B-Instruct-v0.1/int4_awq/1-gpu
Total time used 221.65 s.
[06/03/2024-10:53:48] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
[06/03/2024-10:53:49] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set gemm_plugin to float16.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set nccl_plugin to auto.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set lookup_plugin to None.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set lora_plugin to None.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set moe_plugin to auto.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set context_fmha to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set paged_kv_cache to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set remove_input_padding to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set multi_block_mode to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set enable_xqa to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set multiple_profiles to False.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set paged_state to True.
[06/03/2024-10:53:49] [TRT-LLM] [I] Set streamingllm to False.
[06/03/2024-10:53:49] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len.
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py:1047: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3675.)
weights[name] = preprocessor(param.T.contiguous(),
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 499, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 379, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 338, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 310, in build_model
model = load_model(rank_config, ckpt_dir, model_cls)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1184, in load_model
preprocess_weights(weights,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1047, in preprocess_weights
weights[name] = preprocessor(param.T.contiguous(),
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854, in __call__
return self_._op(*args, **(kwargs or {}))
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 7168 and num_col_bytes = 8. (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:278)
1 0x7f0f58ca8d43 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82
2 0x7f1185b15952 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so(+0x7a952) [0x7f1185b15952]
3 0x7f1185b178ed tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char*, signed char const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 813
4 0x7f1185af7e5e torch_ext::preprocess_weights_for_mixed_gemm(at::Tensor, c10::ScalarType, c10::ScalarType) + 590
5 0x7f1185afe576 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor, c10::ScalarType, c10::ScalarType), at::Tensor, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType, c10::ScalarType> >, true>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 118
6 0x7f1085754028 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const + 568
7 0x7f10854e78d1 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args, pybind11::kwargs const&, std::optional<c10::DispatchKey>) + 449
8 0x7f10854e7fb1 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) + 1329
9 0x7f10853ccb63 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x8d1b63) [0x7f10853ccb63]
10 0x7f1084f78e04 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x47de04) [0x7f1084f78e04]
11 0x561419ffd10e /usr/bin/python3(+0x15a10e) [0x561419ffd10e]
12 0x56141a00c42b PyObject_Call + 187
13 0x561419fe85d7 _PyEval_EvalFrameDefault + 10791
14 0x561419ff2c14 _PyObject_FastCallDictTstate + 196
15 0x56141a00886c _PyObject_Call_Prepend + 92
16 0x56141a123700 /usr/bin/python3(+0x280700) [0x56141a123700]
17 0x561419ff3a7b _PyObject_MakeTpCall + 603
18 0x561419fec096 _PyEval_EvalFrameDefault + 25830
19 0x561419ffd9fc _PyFunction_Vectorcall + 124
20 0x561419fe753c _PyEval_EvalFrameDefault + 6540
21 0x561419ffd9fc _PyFunction_Vectorcall + 124
22 0x561419fe626d _PyEval_EvalFrameDefault + 1725
23 0x561419ffd9fc _PyFunction_Vectorcall + 124
24 0x56141a00c492 PyObject_Call + 290
25 0x561419fe85d7 _PyEval_EvalFrameDefault + 10791
26 0x561419ffd9fc _PyFunction_Vectorcall + 124
27 0x56141a00c492 PyObject_Call + 290
28 0x561419fe85d7 _PyEval_EvalFrameDefault + 10791
29 0x561419ffd9fc _PyFunction_Vectorcall + 124
30 0x56141a00c492 PyObject_Call + 290
31 0x561419fe85d7 _PyEval_EvalFrameDefault + 10791
32 0x561419ffd9fc _PyFunction_Vectorcall + 124
33 0x561419fe626d _PyEval_EvalFrameDefault + 1725
34 0x561419fe29c6 /usr/bin/python3(+0x13f9c6) [0x561419fe29c6]
35 0x56141a0d8256 PyEval_EvalCode + 134
36 0x56141a103108 /usr/bin/python3(+0x260108) [0x56141a103108]
37 0x56141a0fc9cb /usr/bin/python3(+0x2599cb) [0x56141a0fc9cb]
38 0x56141a102e55 /usr/bin/python3(+0x25fe55) [0x56141a102e55]
39 0x56141a102338 _PyRun_SimpleFileObject + 424
40 0x56141a101f83 _PyRun_AnyFileObject + 67
41 0x56141a0f4a5e Py_RunMain + 702
42 0x56141a0cb02d Py_BytesMain + 45
43 0x7f118f6c7d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f118f6c7d90]
44 0x7f118f6c7e40 __libc_start_main + 128
45 0x56141a0caf25 _start + 37
PS: lib versions
tensorrt 10.0.1
tensorrt-cu12 10.0.1
tensorrt-cu12-bindings 10.0.1
tensorrt-cu12-libs 10.0.1
tensorrt-llm 0.11.0.dev2024052800
nvidia-modelopt 0.11.2
Hi @gloritygithub11, we'll have a look firstly and send the update later.
hi @nv-guomingz is there update one the issue?
Goog morning @nv-guomingz! We are also waiting for the fix. Cheers.
Goog morning @nv-guomingz! We are also waiting for the fix. Cheers.
@ChristianPala sorry for late response since the we're on holiday from 6/21~6/22, we'll send u update next week ASAP.
@ChristianPala we can reproduce this issue on our side and we're working on solve this issue by adding zero padding.
@Barry-Delaney would u please update the status of moe int4_awq supporting ?
Hi @ChristianPala @gloritygithub11, int4_awq
for MoE is still not implemented yet, we are working on the development and will update the status here it once it's ready. Thanks for your patience!
Hi @Barry-Delaney , any updates on this?
System Info
ubuntu 20.04 tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.10.0.dev2024050700
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
trtllm-build could execute success
actual behavior
trtllm-build failed with following error: [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700 [05/12/2024-03:05:39] [TRT-LLM] [I] Set bert_attention_plugin to float16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set gemm_plugin to bfloat16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set nccl_plugin to float16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set lookup_plugin to None. [05/12/2024-03:05:39] [TRT-LLM] [I] Set lora_plugin to None. [05/12/2024-03:05:39] [TRT-LLM] [I] Set moe_plugin to float16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [05/12/2024-03:05:39] [TRT-LLM] [I] Set context_fmha to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set paged_kv_cache to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set remove_input_padding to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set use_custom_all_reduce to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set multi_block_mode to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set enable_xqa to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set tokens_per_block to 128. [05/12/2024-03:05:39] [TRT-LLM] [I] Set use_paged_context_fmha to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set multiple_profiles to False. [05/12/2024-03:05:39] [TRT-LLM] [I] Set paged_state to True. [05/12/2024-03:05:39] [TRT-LLM] [I] Set streamingllm to False. [05/12/2024-03:05:39] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [05/12/2024-03:05:39] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py:964: UserWarning: The use of
sys.exit(main())
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 486, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 370, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 329, in build_and_save
engine = build_model(build_config,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 305, in build_model
model = load_model(rank_config, ckpt_dir, model_cls)
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 1100, in load_model
preprocess_weights(weights, model_config)
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 964, in preprocess_weights
weights[name] = preprocessor(param.T.contiguous(),
File "/app/venv_dev/lib/python3.10/site-packages/torch/_ops.py", line 755, in call
return self._op(*args, *(kwargs or {}))
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 7168 and num_col_bytes = 8. (/app/tensorrt-llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:278)
1 0x7f597e9b665a tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 102
2 0x7f5ba6d945dd void tensorrt_llm::kernels::cutlass_kernels::subbyte_transpose_impl<(tensorrt_llm::kernels::cutlass_kernels::QuantType)1>(signed char, signed char const, std::vector<unsigned long, std::allocator > const&) + 1085
3 0x7f5ba6d93735 tensorrt_llm::kernels::cutlass_kernels::subbyte_transpose(signed char, signed char const, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType) + 101
4 0x7f5ba6d93a4a tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char, signed char const, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 714
5 0x7f5ba6d6d7f4 torch_ext::preprocess_weights_for_mixed_gemm(at::Tensor, c10::ScalarType, c10::ScalarType) + 596
6 0x7f5ba6d7940a c10::impl::make_boxed_from_unboxedfunctor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor<at::Tensor ()(at::Tensor, c10::ScalarType, c10::ScalarType), at::Tensor, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType, c10::ScalarType> >, true>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator >) + 138
7 0x7f5b08bcb818 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator > ) const + 568
8 0x7f5b0895c4f3 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr, std::allocator<std::shared_ptr > > const&, pybind11::args, pybind11::kwargs const&, std::optional) + 451
9 0x7f5b0895cd41 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr, std::allocator<std::shared_ptr > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optional) + 1329
10 0x7f5b08840833 /app/venv_dev/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x848833) [0x7f5b08840833]
11 0x7f5b0840bea4 /app/venv_dev/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x413ea4) [0x7f5b0840bea4]
12 0x53bd79 /app/venv_dev/bin/python3() [0x53bd79]
13 0x628a7b PyObject_Call + 491
14 0x5afa8e _PyEval_EvalFrameDefault + 24958
15 0x628d60 _PyFunction_Vectorcall + 592
16 0x62b899 _PyObject_FastCallDictTstate + 89
17 0x62b9ca _PyObject_Call_Prepend + 90
18 0x6e8da7 /app/venv_dev/bin/python3() [0x6e8da7]
19 0x629d24 _PyObject_MakeTpCall + 356
20 0x5ae9e9 _PyEval_EvalFrameDefault + 20697
21 0x628d60 _PyFunction_Vectorcall + 592
22 0x5a9c1b _PyEval_EvalFrameDefault + 779
23 0x628d60 _PyFunction_Vectorcall + 592
24 0x5a9c1b _PyEval_EvalFrameDefault + 779
25 0x628d60 _PyFunction_Vectorcall + 592
26 0x62893c PyObject_Call + 172
27 0x5ac51b _PyEval_EvalFrameDefault + 11275
28 0x628d60 _PyFunction_Vectorcall + 592
29 0x62893c PyObject_Call + 172
30 0x5ac51b _PyEval_EvalFrameDefault + 11275
31 0x628d60 _PyFunction_Vectorcall + 592
32 0x62893c PyObject_Call + 172
33 0x5ac51b _PyEval_EvalFrameDefault + 11275
34 0x628d60 _PyFunction_Vectorcall + 592
35 0x5a9c1b _PyEval_EvalFrameDefault + 779
36 0x5a8bf1 /app/venv_dev/bin/python3() [0x5a8bf1]
37 0x6d77cf PyEval_EvalCode + 127
38 0x6bb91b /app/venv_dev/bin/python3() [0x6bb91b]
39 0x6bb9a4 /app/venv_dev/bin/python3() [0x6bb9a4]
40 0x6bbde6 /app/venv_dev/bin/python3() [0x6bbde6]
41 0x6c0c84 _PyRun_SimpleFileObject + 404
42 0x6c0d57 _PyRun_AnyFileObject + 71
43 0x7042dd Py_RunMain + 877
44 0x7044bd Py_BytesMain + 45
45 0x7f5bab4e4083 __libc_start_main + 243
46 0x62ff4e _start + 46
x.T
on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Considerx.mT
to transpose batches of matrices orx.permute(*torch.arange(x.ndim - 1, -1, -1))
to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.) weights[name] = preprocessor(param.T.contiguous(), Traceback (most recent call last): File "/app/venv_dev/bin/trtllm-build", line 8, inadditional notes
N/A