NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

[ERROR] Assertion failed: Can't free tmp workspace for GEMM tactics profiling. #1739

Open grvsh02 opened 3 months ago

grvsh02 commented 3 months ago

System Info

Who can help?

No response

Information

Reproduction

!python /kaggle/working/TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir mistralai/Mistral-7B-v0.3 \ --output_dir ./tllm_checkpoint_mixtral \ --dtype float16 \ --tp_size 1 \ --use_weight_only \ --weight_only_precision int8

!trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral \ --output_dir ./trt_engines/mixtral/tp1 \ --max_input_len 2048 \ --max_output_len 256

Expected behavior

A triton build of the Mixtral 7B v0.3 Model quantized to int 8

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400 [06/05/2024-13:40:05] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 149, GPU 103 (MiB) [06/05/2024-13:40:06] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +945, GPU +180, now: CPU 1230, GPU 283 (MiB) [06/05/2024-13:40:06] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_75. [06/05/2024-13:40:06] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32. [06/05/2024-13:40:06] [TRT] [W] Unused Input: position_ids [06/05/2024-13:40:06] [TRT] [W] Detected layernorm nodes in FP16. [06/05/2024-13:40:06] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [06/05/2024-13:40:07] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [06/05/2024-13:40:07] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32. [06/05/2024-13:40:07] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [06/05/2024-13:41:01] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [06/05/2024-13:41:01] [TRT] [I] Detected 14 inputs and 1 output network tensors. [06/05/2024-13:41:06] [TRT] [I] Total Host Persistent Memory: 107552 [06/05/2024-13:41:06] [TRT] [I] Total Device Persistent Memory: 0 [06/05/2024-13:41:06] [TRT] [I] Total Scratch Memory: 257954304 [06/05/2024-13:41:06] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 556 steps to complete. [06/05/2024-13:41:06] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 44.0467ms to assign 16 blocks to 556 nodes requiring 333457408 bytes. [06/05/2024-13:41:06] [TRT] [I] Total Activation Memory: 333456384 [06/05/2024-13:41:06] [TRT] [I] Total Weights Memory: 7536452864

[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1, n=6144, k=4096). Will try to use default or fail at runtime [TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=2, n=6144, k=4096). Will try to use default or fail at runtime [TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=4, n=6144, k=4096). Will try to use default or fail at runtime [TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=8, n=6144, k=4096). Will try to use default or fail at runtime [TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=16, n=6144, k=4096). Will try to use default or fail at runtime [TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=32, n=6144, k=4096). Will try to use default or fail at runtime [TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=64, n=6144, k=4096). Will try to use default or fail at runtime [TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=128, n=6144, k=4096). Will try to use default or fail at runtime [TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=256, n=6144, k=4096). Will try to use default or fail at runtime [TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=512, n=6144, k=4096). Will try to use default or fail at runtime [TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1024, n=6144, k=4096). Will try to use default or fail at runtime terminate called after throwing an instance of 'tensorrt_llm::common::TllmException' what(): [TensorRT-LLM][ERROR] Assertion failed: Can't free tmp workspace for GEMM tactics profiling. (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/gemmPluginProfiler.cpp:204) 1 0x7eca3eea05af /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x7e5af) [0x7eca3eea05af] 2 0x7eca3ef69446 tensorrt_llm::plugins::GemmPluginProfiler<tensorrt_llm::cutlass_extensions::CutlassGemmConfig, std::shared_ptr, tensorrt_llm::plugins::GemmIdCore, tensorrt_llm::plugins::GemmIdCoreHash>::freeTmpData() + 70 3 0x7eca3ef74c28 tensorrt_llm::plugins::GemmPluginProfiler<tensorrt_llm::cutlass_extensions::CutlassGemmConfig, std::shared_ptr, tensorrt_llm::plugins::GemmIdCore, tensorrt_llm::plugins::GemmIdCoreHash>::profileTactics(std::shared_ptr const&, nvinfer1::DataType const&, tensorrt_llm::plugins::GemmDims const&, tensorrt_llm::plugins::GemmIdCore const&) + 1272 4 0x7eca3ef47789 tensorrt_llm::plugins::WeightOnlyQuantMatmulPlugin::initialize() + 9 5 0x7ecbd6834a25 /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0x1065a25) [0x7ecbd6834a25] 6 0x7ecbd67c10aa /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xff20aa) [0x7ecbd67c10aa] 7 0x7ecbd65adfcf /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xddefcf) [0x7ecbd65adfcf] 8 0x7ecbd65b007c /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde107c) [0x7ecbd65b007c] 9 0x7ecbd65b2071 /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde3071) [0x7ecbd65b2071] 10 0x7ecbd61f761c /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2861c) [0x7ecbd61f761c] 11 0x7ecbd61fc837 /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2d837) [0x7ecbd61fc837] 12 0x7ecbd61fd1af /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2e1af) [0x7ecbd61fd1af] 13 0x7ecbd115e478 /opt/conda/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0xa6478) [0x7ecbd115e478] 14 0x7ecbd10fd7a3 /opt/conda/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0x457a3) [0x7ecbd10fd7a3] 15 0x5593491e0446 /opt/conda/bin/python3.10(+0x144446) [0x5593491e0446] 16 0x5593491d997b _PyObject_MakeTpCall + 619 17 0x5593491ec6e6 /opt/conda/bin/python3.10(+0x1506e6) [0x5593491ec6e6] 18 0x5593491d5022 _PyEval_EvalFrameDefault + 19474 19 0x5593491e08cc _PyFunction_Vectorcall + 108 20 0x5593491d3194 _PyEval_EvalFrameDefault + 11652 21 0x5593491e08cc _PyFunction_Vectorcall + 108 22 0x5593491d0b3c _PyEval_EvalFrameDefault + 1836 23 0x5593491e08cc _PyFunction_Vectorcall + 108 24 0x5593491d0730 _PyEval_EvalFrameDefault + 800 25 0x5593491e08cc _PyFunction_Vectorcall + 108 26 0x5593491ecd9c PyObject_Call + 188 27 0x5593491d3194 _PyEval_EvalFrameDefault + 11652 28 0x5593491e08cc _PyFunction_Vectorcall + 108 29 0x5593491ecd9c PyObject_Call + 188 30 0x5593491d3194 _PyEval_EvalFrameDefault + 11652 31 0x5593491e08cc _PyFunction_Vectorcall + 108 32 0x5593491ecd9c PyObject_Call + 188 33 0x5593491d3194 _PyEval_EvalFrameDefault + 11652 34 0x5593491e08cc _PyFunction_Vectorcall + 108 35 0x5593491d0730 _PyEval_EvalFrameDefault + 800 36 0x559349273870 /opt/conda/bin/python3.10(+0x1d7870) [0x559349273870] 37 0x5593492737b7 PyEval_EvalCode + 135 38 0x5593492a3d1a /opt/conda/bin/python3.10(+0x207d1a) [0x5593492a3d1a] 39 0x55934929f123 /opt/conda/bin/python3.10(+0x203123) [0x55934929f123] 40 0x5593491364d1 /opt/conda/bin/python3.10(+0x9a4d1) [0x5593491364d1] 41 0x55934929960e _PyRun_SimpleFileObject + 430 42 0x5593492991a4 _PyRun_AnyFileObject + 68 43 0x55934929639b Py_RunMain + 907 44 0x559349266e17 Py_BytesMain + 55 45 0x7ecc440f9083 libc_start_main + 243 46 0x559349266d11 /opt/conda/bin/python3.10(+0x1cad11) [0x559349266d11] [57166b281e39:03015] Process received signal [57166b281e39:03015] Signal: Aborted (6) [57166b281e39:03015] Signal code: (-6) [57166b281e39:03015] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7ecc44435420] [57166b281e39:03015] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ecc4411800b] [57166b281e39:03015] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ecc440f7859] [57166b281e39:03015] [ 3] /opt/conda/bin/../lib/libstdc++.so.6(_ZN9gnu_cxx27verbose_terminate_handlerEv+0xc0)[0x7ecc377acf00] [57166b281e39:03015] [ 4] /opt/conda/bin/../lib/libstdc++.so.6(+0xb643c)[0x7ecc377ab43c] [57166b281e39:03015] [ 5] /opt/conda/bin/../lib/libstdc++.so.6(+0xb57ff)[0x7ecc377aa7ff] [57166b281e39:03015] [ 6] /opt/conda/bin/../lib/libstdc++.so.6(gxx_personality_v0+0x356)[0x7ecc377ab07f] [57166b281e39:03015] [ 7] /opt/conda/bin/../lib/libgcc_s.so.1(+0x12743)[0x7ecc43c77743] [57166b281e39:03015] [ 8] /opt/conda/bin/../lib/libgcc_s.so.1(_Unwind_Resume+0x65)[0x7ecc43c77d04] [57166b281e39:03015] [ 9] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins18GemmPluginProfilerINS_18cutlass_extensions17CutlassGemmConfigESt10shared_ptrINS_7kernels15cutlass_kernels33CutlassFpAIntBGemmRunnerInterfaceEENS0_10GemmIdCoreENS0_14GemmIdCoreHashEE14profileTacticsERKS8_RKN8nvinfer18DataTypeERKNS08GemmDimsERKS9+0x754)[0x7eca3ef74e84] [57166b281e39:03015] [10] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins27WeightOnlyQuantMatmulPlugin10initializeEv+0x9)[0x7eca3ef47789] [57166b281e39:03015] [11] /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0x1065a25)[0x7ecbd6834a25] [57166b281e39:03015] [12] /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xff20aa)[0x7ecbd67c10aa] [57166b281e39:03015] [13] /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xddefcf)[0x7ecbd65adfcf] [57166b281e39:03015] [14] /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde107c)[0x7ecbd65b007c] [57166b281e39:03015] [15] /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde3071)[0x7ecbd65b2071] [57166b281e39:03015] [16] /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2861c)[0x7ecbd61f761c] [57166b281e39:03015] [17] /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2d837)[0x7ecbd61fc837] [57166b281e39:03015] [18] /opt/conda/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2e1af)[0x7ecbd61fd1af] [57166b281e39:03015] [19] /opt/conda/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0xa6478)[0x7ecbd115e478] [57166b281e39:03015] [20] /opt/conda/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0x457a3)[0x7ecbd10fd7a3] [57166b281e39:03015] [21] /opt/conda/bin/python3.10(+0x144446)[0x5593491e0446] [57166b281e39:03015] [22] /opt/conda/bin/python3.10(_PyObject_MakeTpCall+0x26b)[0x5593491d997b] [57166b281e39:03015] [23] /opt/conda/bin/python3.10(+0x1506e6)[0x5593491ec6e6] [57166b281e39:03015] [24] /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x4c12)[0x5593491d5022] [57166b281e39:03015] [25] /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c)[0x5593491e08cc] [57166b281e39:03015] [26] /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d84)[0x5593491d3194] [57166b281e39:03015] [27] /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c)[0x5593491e08cc] [57166b281e39:03015] [28] /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c)[0x5593491d0b3c] [57166b281e39:03015] [29] /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c)[0x5593491e08cc] [57166b281e39:03015] End of error message

additional notes

  • I'm using kaggle to compile the model
hijkzzz commented 3 months ago

Duplicated: https://github.com/NVIDIA/TensorRT-LLM/issues/1732

nv-guomingz commented 3 months ago

I can reproduce this issue on my env, we've filed one bug to track this issue. @hijkzzz

1ytic commented 2 months ago

I found this comment: weight only is not supported on T4 in #1416.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."