NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.65k stars 833 forks source link

[Question] Weight-only quantization seems doesn't work: "Missing scale and zero-point", "Have not found any valid GEMM config" #1416

Closed siahuat0727 closed 3 months ago

siahuat0727 commented 3 months ago

System Info

Google Colab with GPU T4 and CUDA 12.2. TensorRT-LLM version: 0.9.0.dev2024040200. Here is the minimum reproducible notebook on Google Colab.

Who can help?

@Tracin @byshiue

Information

Tasks

Reproduction

The official example for bloom, but with additional flags --use_weight_only --weight_only_precision int4.

python TensorRT-LLM/examples/bloom/convert_checkpoint.py --model_dir ./bloom/560M/   --dtype float16 --output_dir ./bloom/560M/trt_ckpt/int4/1-gpu/ --use_weight_only --weight_only_precision int4
trtllm-build --checkpoint_dir ./bloom/560M/trt_ckpt/int4/1-gpu/ --gemm_plugin float16 --output_dir ./bloom/560M/trt_engines/int4/1-gpu/
python TensorRT-LLM/examples/summarize.py --test_trt_llm   --hf_model_dir ./bloom/560M/ --data_type fp16 --engine_dir ./bloom/560M/trt_engines/int4/1-gpu/

It works if convert_checkpoint.py is run without the flags --use_weight_only --weight_only_precision int4. I added these two flags because I wanted to test the latency of weight-only quantization on this task.

Expected behavior

I expected examples/summarize.py to pass with the quantized trt engine.

actual behavior

2024-04-08 05:59:44.849993: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-08 05:59:44.850107: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-08 05:59:44.852199: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-08 05:59:46.356775: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
0.9.0.dev2024040200
[04/08/2024-05:59:50] [TRT-LLM] [I] Convert by using model
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Weights loaded. Total time: 00:00:07
Total time of converting checkpoints: 00:00:23
2024-04-08 06:00:26.886681: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-08 06:00:26.886741: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-08 06:00:26.888280: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-08 06:00:28.025336: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
[04/08/2024-06:00:31] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set gemm_plugin to float16.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set lookup_plugin to None.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set lora_plugin to None.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set moe_plugin to float16.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set context_fmha to True.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set remove_input_padding to True.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set multi_block_mode to False.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set enable_xqa to True.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set tokens_per_block to 128.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set multiple_profiles to False.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set paged_state to True.
[04/08/2024-06:00:31] [TRT-LLM] [I] Set streamingllm to False.
[04/08/2024-06:00:31] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[04/08/2024-06:00:31] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[04/08/2024-06:00:31] [TRT-LLM] [W] Fail to infer cluster key, use A100-SXM-80GB as fallback.
[04/08/2024-06:00:31] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 267, GPU 103 (MiB)
[04/08/2024-06:00:37] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +978, GPU +180, now: CPU 1381, GPU 283 (MiB)
[04/08/2024-06:00:37] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16.
[04/08/2024-06:00:37] [TRT-LLM] [I] Set nccl_plugin to None.
[04/08/2024-06:00:37] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_75.
[04/08/2024-06:00:38] [TRT] [W] IElementWiseLayer with inputs BloomForCausalLM/transformer/layers/0/mlp/SHUFFLE_0_output_0 and BloomForCausalLM/transformer/layers/0/mlp/fc/ELEMENTWISE_SUM_0_output_0: first input has type Float but second input has type Half.
[04/08/2024-06:00:38] [TRT] [W] IElementWiseLayer with inputs BloomForCausalLM/transformer/layers/0/mlp/fc/ELEMENTWISE_SUM_0_output_0 and BloomForCausalLM/transformer/layers/0/mlp/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/08/2024-06:00:38] [TRT] [W] IElementWiseLayer with inputs BloomForCausalLM/transformer/layers/0/mlp/SHUFFLE_2_output_0 and BloomForCausalLM/transformer/layers/0/mlp/ELEMENTWISE_POW_0_output_0: first input has type Float but second input has type Half.
[04/08/2024-06:00:38] [TRT] [W] IElementWiseLayer with inputs BloomForCausalLM/transformer/layers/0/mlp/fc/ELEMENTWISE_SUM_0_output_0 and BloomForCausalLM/transformer/layers/0/mlp/ELEMENTWISE_PROD_1_output_0: first input has type Half but second input has type Float.
[04/08/2024-06:00:38] [TRT] [W] IElementWiseLayer with inputs BloomForCausalLM/transformer/layers/0/mlp/SHUFFLE_3_output_0 and BloomForCausalLM/transformer/layers/0/mlp/ELEMENTWISE_SUM_0_output_0: first input has type Float but second input has type Half
....
[04/08/2024-06:00:38] [TRT] [W] IElementWiseLayer with inputs BloomForCausalLM/transformer/layers/23/mlp/fc/ELEMENTWISE_SUM_0_output_0 and BloomForCausalLM/transformer/layers/23/mlp/ELEMENTWISE_PROD_1_output_0: first input has type Half but second input has type Float.
[04/08/2024-06:00:38] [TRT] [W] IElementWiseLayer with inputs BloomForCausalLM/transformer/layers/23/mlp/SHUFFLE_3_output_0 and BloomForCausalLM/transformer/layers/23/mlp/ELEMENTWISE_SUM_0_output_0: first input has type Float but second input has type Half.
[04/08/2024-06:00:38] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[04/08/2024-06:00:38] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[04/08/2024-06:00:38] [TRT] [W] Unused Input: position_ids
[04/08/2024-06:00:38] [TRT] [W] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor input_ids, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor position_ids, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
...
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/layers/23/mlp/ELEMENTWISE_SUM_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/layers/23/mlp/ELEMENTWISE_PROD_3_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/layers/23/mlp/proj/CONSTANT_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/layers/23/mlp/proj/PLUGIN_V2_WeightOnlyQuantMatmul_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/layers/23/mlp/proj/CONSTANT_2_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/layers/23/mlp/proj/SHUFFLE_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/layers/23/mlp/proj/ELEMENTWISE_SUM_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/layers/23/ELEMENTWISE_SUM_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/ln_f/CONSTANT_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/ln_f/CONSTANT_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/ln_f/SHUFFLE_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/ln_f/SHUFFLE_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/transformer/ln_f/NORMALIZATION_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/CONSTANT_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/ELEMENTWISE_SUB_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/GATHER_2_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/SHUFFLE_4_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/SHUFFLE_9_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor BloomForCausalLM/lm_head/CONSTANT_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] Missing scale and zero-point for tensor logits, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[04/08/2024-06:00:38] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[04/08/2024-06:00:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1398, GPU 301 (MiB)
[04/08/2024-06:00:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 1400, GPU 311 (MiB)
[04/08/2024-06:00:38] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/08/2024-06:00:38] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[04/08/2024-06:00:38] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[04/08/2024-06:01:00] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[04/08/2024-06:01:00] [TRT] [I] Detected 13 inputs and 1 output network tensors.
[04/08/2024-06:01:05] [TRT] [I] Total Host Persistent Memory: 57808
[04/08/2024-06:01:05] [TRT] [I] Total Device Persistent Memory: 0
[04/08/2024-06:01:05] [TRT] [I] Total Scratch Memory: 144707840
[04/08/2024-06:01:05] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 397 steps to complete.
[04/08/2024-06:01:05] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 50.0109ms to assign 14 blocks to 397 nodes requiring 157296128 bytes.
[04/08/2024-06:01:05] [TRT] [I] Total Activation Memory: 157295616
[04/08/2024-06:01:05] [TRT] [I] Total Weights Memory: 1179724288
[04/08/2024-06:01:05] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1524, GPU 1449 (MiB)
[04/08/2024-06:01:05] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1524, GPU 1459 (MiB)
[04/08/2024-06:01:05] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
void cutlass::gemm::kernel::GemmFpAIntB<Mma_, Epilogue_, ThreadblockSwizzle_, KernelArch, SplitKSerial>::run_kernel(const cutlass::gemm::kernel::GemmFpAIntB<Mma_, Epilogue_, ThreadblockSwizzle_, KernelArch, SplitKSerial>::Params &, cutlass::gemm::kernel::GemmFpAIntB<Mma_, Epilogue_, ThreadblockSwizzle_, KernelArch, SplitKSerial>::SharedStorage &) [with CompilationArch = cutlass::arch::Sm70; Mma_ = cutlass::gemm::threadblock::DqMmaPipelined<cutlass::gemm::GemmShape<16, 128, 64>, cutlass::transform::threadblock::PredicatedTileIterator<cutlass::MatrixShape<16, 64>, cutlass::half_t, cutlass::layout::RowMajor, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<64, 16>, 128, cutlass::PitchLinearShape<8, 4>, 8>, 8, false, cutlass::layout::NoPermute>, cutlass::transform::threadblock::RegularTileIterator<cutlass::MatrixShape<16, 64>, cutlass::half_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 64>, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<64, 16>, 128, cutlass::PitchLinearShape<8, 4>, 8>, 16>, 
...
cutlass::epilogue::threadblock::SharedLoadIteratorMixed<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 1, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 2, 1, 1, 2>, 128, 8, 16>::CompactedThreadMap, float, 32, 16, 8, 8, false>, cutlass::epilogue::thread::LinearCombination<cutlass::half_t, 8, float, float, cutlass::epilogue::thread::ScaleType::NoBetaScaling, cutlass::FloatRoundStyle::round_to_nearest, cutlass::half_t>, cutlass::MatrixShape<0, 8>, 2, 1>; ThreadblockSwizzle_ = cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>; KernelArch = cutlass::arch::Sm75; __nv_bool SplitKSerial = true] not implemented
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1, n=1536, k=1024). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=2, n=1536, k=1024). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=4, n=1536, k=1024). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=8, n=1536, k=1024). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=16, n=1536, k=1024). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=32, n=1536, k=1024). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=64, n=1536, k=1024). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=128, n=1536, k=1024). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=256, n=1536, k=1024). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=512, n=1536, k=1024). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1024, n=1536, k=1024). Will try to use default or fail at runtime
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: Can't free tmp workspace for GEMM tactics profiling. (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/gemmPluginProfiler.cpp:179)
1       0x7c49506fa305 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x51305) [0x7c49506fa305]
2       0x7c495082fa96 tensorrt_llm::plugins::GemmPluginProfiler<tensorrt_llm::cutlass_extensions::CutlassGemmConfig, std::shared_ptr<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunnerInterface>, tensorrt_llm::plugins::GemmIdCore, tensorrt_llm::plugins::GemmIdCoreHash>::freeTmpData() + 70
3       0x7c4950838fdb tensorrt_llm::plugins::GemmPluginProfiler<tensorrt_llm::cutlass_extensions::CutlassGemmConfig, std::shared_ptr<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunnerInterface>, tensorrt_llm::plugins::GemmIdCore, tensorrt_llm::plugins::GemmIdCoreHash>::profileTactics(std::shared_ptr<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunnerInterface> const&, nvinfer1::DataType const&, tensorrt_llm::plugins::GemmDims const&, tensorrt_llm::plugins::GemmIdCore const&) + 987
4       0x7c4950815b39 tensorrt_llm::plugins::WeightOnlyQuantMatmulPlugin::initialize() + 9
5       0x7c4aed3c43c5 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0x10d63c5) [0x7c4aed3c43c5]
6       0x7c4aed370da2 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0x1082da2) [0x7c4aed370da2]
7       0x7c4aed15b42f /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xe6d42f) [0x7c4aed15b42f]
8       0x7c4aed15d31c /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xe6f31c) [0x7c4aed15d31c]
9       0x7c4aed15f4cc /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xe714cc) [0x7c4aed15f4cc]
10      0x7c4aecd50dd7 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xa62dd7) [0x7c4aecd50dd7]
11      0x7c4aecd56331 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xa68331) [0x7c4aecd56331]
12      0x7c4aecd56eae /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xa68eae) [0x7c4aecd56eae]
13      0x7c4a988a5238 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa5238) [0x7c4a988a5238]
14      0x7c4a98843443 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x43443) [0x7c4a98843443]
15      0x5a1711c8410e /usr/bin/python3(+0x15a10e) [0x5a1711c8410e]
16      0x5a1711c7aa7b _PyObject_MakeTpCall + 603
17      0x5a1711c92acb /usr/bin/python3(+0x168acb) [0x5a1711c92acb]
18      0x5a1711c72cfa _PyEval_EvalFrameDefault + 24906
19      0x5a1711c849fc _PyFunction_Vectorcall + 124
20      0x5a1711c6f5d7 _PyEval_EvalFrameDefault + 10791
21      0x5a1711c849fc _PyFunction_Vectorcall + 124
22      0x5a1711c6d45c _PyEval_EvalFrameDefault + 2220
23      0x5a1711c849fc _PyFunction_Vectorcall + 124
24      0x5a1711c6d26d _PyEval_EvalFrameDefault + 1725
25      0x5a1711c849fc _PyFunction_Vectorcall + 124
26      0x5a1711c93492 PyObject_Call + 290
27      0x5a1711c6f5d7 _PyEval_EvalFrameDefault + 10791
28      0x5a1711c849fc _PyFunction_Vectorcall + 124
29      0x5a1711c93492 PyObject_Call + 290
30      0x5a1711c6f5d7 _PyEval_EvalFrameDefault + 10791
31      0x5a1711c849fc _PyFunction_Vectorcall + 124
32      0x5a1711c93492 PyObject_Call + 290
33      0x5a1711c6f5d7 _PyEval_EvalFrameDefault + 10791
34      0x5a1711c849fc _PyFunction_Vectorcall + 124
35      0x5a1711c6d26d _PyEval_EvalFrameDefault + 1725
36      0x5a1711c699c6 /usr/bin/python3(+0x13f9c6) [0x5a1711c699c6]
37      0x5a1711d5f256 PyEval_EvalCode + 134
38      0x5a1711d8a108 /usr/bin/python3(+0x260108) [0x5a1711d8a108]
39      0x5a1711d839cb /usr/bin/python3(+0x2599cb) [0x5a1711d839cb]
40      0x5a1711d89e55 /usr/bin/python3(+0x25fe55) [0x5a1711d89e55]
41      0x5a1711d89338 _PyRun_SimpleFileObject + 424
42      0x5a1711d88f83 _PyRun_AnyFileObject + 67
43      0x5a1711d7ba5e Py_RunMain + 702
44      0x5a1711d5202d Py_BytesMain + 45
45      0x7c4c40a74d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7c4c40a74d90]
46      0x7c4c40a74e40 __libc_start_main + 128
47      0x5a1711d51f25 _start + 37
[e8289975085f:07831] *** Process received signal ***
[e8289975085f:07831] Signal: Aborted (6)
[e8289975085f:07831] Signal code:  (-6)
[e8289975085f:07831] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7c4c40a8d520]
...
[e8289975085f:07831] [29] /usr/bin/python3(_PyEval_EvalFrameDefault+0x8ac)[0x5a1711c6d45c]
[e8289975085f:07831] *** End of error message ***
2024-04-08 06:01:10.369094: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-08 06:01:10.369156: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-08 06:01:10.370657: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

additional notes

Here is the minimum reproducible notebook on Google Colab. I'm wondering if it doesn't make sense to add these weight-only quantization flags to test summarize.py, or if GPU T4 isn't adapted to the functionality I'm testing, or if it's some other environmental issue. Really appreciate your feedback and guidance, thank you.

byshiue commented 3 months ago

TensorRT-LLM does not support weight only on T4. T4 is not in the support list of TensorRT-LLM. Could you try on Ampere Hopper GPUs?

luxiushu2023 commented 3 months ago

met the same problem, could you give some advises to hand this problem except change gpu arch

byshiue commented 3 months ago

Since weight only is not supported on T4. If you want to run on T4, you could only use float16.