NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.63k stars 832 forks source link

Mixtral 8x7b to TRT fails with Assertion failed: Can't allocate tmp workspace for GEMM tactics profiling. #1908

Closed OrZipori closed 2 weeks ago

OrZipori commented 2 weeks ago

System Info

followed https://nvidia.github.io/TensorRT-LLM/installation/linux.html to create the docker and install all perquisite.

Who can help?

No response

Information

Tasks

Reproduction

I followed this https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mixtral page to compile and deploy Mixtral 8x7B.

python3 convert_checkpoint.py --model_dir /model
--output_dir /tmp/tllm_checkpoint_mixtral_2gpu --dtype float16
--tp_size 2

the above command works and produces the artifacts.

trtllm-build --checkpoint_dir /tmp/tllm_checkpoint_mixtral_2gpu
--output_dir /tmp/tllm_checkpoint_mixtral_2gpu_trt
--gemm_plugin float16

fails with : [TensorRT-LLM][ERROR] Assertion failed: Can't allocate tmp workspace for GEMM tactics profiling. (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/gemmPluginProfiler.cpp:197)

Expected behavior

TRT Artifacts.

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024070200 [07/07/2024-09:21:10] [TRT-LLM] [I] Set bert_attention_plugin to auto. [07/07/2024-09:21:10] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [07/07/2024-09:21:10] [TRT-LLM] [I] Set gemm_plugin to float16. [07/07/2024-09:21:10] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [07/07/2024-09:21:10] [TRT-LLM] [I] Set nccl_plugin to auto. [07/07/2024-09:21:10] [TRT-LLM] [I] Set lookup_plugin to None. [07/07/2024-09:21:10] [TRT-LLM] [I] Set lora_plugin to None. [07/07/2024-09:21:10] [TRT-LLM] [I] Set moe_plugin to auto. [07/07/2024-09:21:10] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [07/07/2024-09:21:10] [TRT-LLM] [I] Set context_fmha to True. [07/07/2024-09:21:10] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [07/07/2024-09:21:10] [TRT-LLM] [I] Set paged_kv_cache to True. [07/07/2024-09:21:10] [TRT-LLM] [I] Set remove_input_padding to True. [07/07/2024-09:21:10] [TRT-LLM] [I] Set use_custom_all_reduce to True. [07/07/2024-09:21:10] [TRT-LLM] [I] Set reduce_fusion to False. [07/07/2024-09:21:10] [TRT-LLM] [I] Set multi_block_mode to False. [07/07/2024-09:21:10] [TRT-LLM] [I] Set enable_xqa to True. [07/07/2024-09:21:10] [TRT-LLM] [I] Set tokens_per_block to 64. [07/07/2024-09:21:10] [TRT-LLM] [I] Set use_paged_context_fmha to False. [07/07/2024-09:21:10] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [07/07/2024-09:21:10] [TRT-LLM] [I] Set multiple_profiles to False. [07/07/2024-09:21:10] [TRT-LLM] [I] Set paged_state to True. [07/07/2024-09:21:10] [TRT-LLM] [I] Set streamingllm to False. [07/07/2024-09:21:10] [TRT-LLM] [I] max_seq_len is not specified, using value 32768 [07/07/2024-09:21:10] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[07/07/2024-09:21:10] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored [07/07/2024-09:21:11] [TRT-LLM] [I] Set dtype to float16. [07/07/2024-09:21:11] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 143, GPU 432 (MiB) [07/07/2024-09:21:13] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1644, GPU +292, now: CPU 1935, GPU 724 (MiB) [07/07/2024-09:21:13] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [07/07/2024-09:21:13] [TRT-LLM] [I] Set nccl_plugin to float16. [07/07/2024-09:21:13] [TRT-LLM] [I] Set use_custom_all_reduce to True. [07/07/2024-09:21:14] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [07/07/2024-09:21:14] [TRT] [W] Unused Input: position_ids [07/07/2024-09:21:14] [TRT] [W] Detected layernorm nodes in FP16. [07/07/2024-09:21:14] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [07/07/2024-09:21:14] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [07/07/2024-09:21:14] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [07/07/2024-09:21:17] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [07/07/2024-09:21:17] [TRT] [I] Detected 15 inputs and 1 output network tensors. [07/07/2024-09:21:47] [TRT] [I] Total Host Persistent Memory: 107008 [07/07/2024-09:21:47] [TRT] [I] Total Device Persistent Memory: 0 [07/07/2024-09:21:47] [TRT] [I] Total Scratch Memory: 705237632 [07/07/2024-09:21:47] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 562 steps to complete. [07/07/2024-09:21:47] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 58.1094ms to assign 17 blocks to 562 nodes requiring 973679616 bytes. [07/07/2024-09:21:47] [TRT] [I] Total Activation Memory: 973678592 [07/07/2024-09:21:47] [TRT] [I] Total Weights Memory: 46854053888 terminate called after throwing an instance of 'tensorrt_llm::common::TllmException' what(): [TensorRT-LLM][ERROR] Assertion failed: Can't allocate tmp workspace for GEMM tactics profiling. (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/gemmPluginProfiler.cpp:197) 1 0x7f20b009766f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x8066f) [0x7f20b009766f] 2 0x7f20b0171772 tensorrt_llm::plugins::GemmPluginProfiler<tensorrt_llm::cutlass_extensions::CutlassGemmConfig, tensorrt_llm::plugins::MixtureOfExpertsPlugin, tensorrt_llm::plugins::GemmIDMoe, tensorrt_llm::plugins::GemmIDMoeHash>::allocateTmpData() + 82 3 0x7f20b017bf38 tensorrt_llm::plugins::GemmPluginProfiler<tensorrt_llm::cutlass_extensions::CutlassGemmConfig, tensorrt_llm::plugins::MixtureOfExpertsPlugin, tensorrt_llm::plugins::GemmIDMoe, tensorrt_llm::plugins::GemmIDMoeHash>::profileTactics(tensorrt_llm::plugins::MixtureOfExpertsPlugin* const&, nvinfer1::DataType const&, tensorrt_llm::plugins::GemmDims const&, tensorrt_llm::plugins::GemmIDMoe const&) + 1416 4 0x7f20b015c1a9 tensorrt_llm::plugins::MixtureOfExpertsPlugin::initialize() + 41 5 0x7f227b1b46e5 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x108c6e5) [0x7f227b1b46e5] 6 0x7f227b141de2 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1019de2) [0x7f227b141de2] 7 0x7f227af2c56c /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe0456c) [0x7f227af2c56c] 8 0x7f227af2e21c /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe0621c) [0x7f227af2e21c] 9 0x7f227af30328 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe08328) [0x7f227af30328] 10 0x7f227ab7f2ac /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa572ac) [0x7f227ab7f2ac] 11 0x7f227ab84501 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa5c501) [0x7f227ab84501] 12 0x7f227ab84f0b /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa5cf0b) [0x7f227ab84f0b] 13 0x7f2224aa7458 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa7458) [0x7f2224aa7458] 14 0x7f2224a458f3 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x458f3) [0x7f2224a458f3] 15 0x55c079b4410e /usr/bin/python3(+0x15a10e) [0x55c079b4410e] 16 0x55c079b3aa7b _PyObject_MakeTpCall + 603 17 0x55c079b52acb /usr/bin/python3(+0x168acb) [0x55c079b52acb] 18 0x55c079b32cfa _PyEval_EvalFrameDefault + 24906 19 0x55c079b449fc _PyFunction_Vectorcall + 124 20 0x55c079b2f5d7 _PyEval_EvalFrameDefault + 10791 21 0x55c079b449fc _PyFunction_Vectorcall + 124 22 0x55c079b2d45c _PyEval_EvalFrameDefault + 2220 23 0x55c079b449fc _PyFunction_Vectorcall + 124 24 0x55c079b2d26d _PyEval_EvalFrameDefault + 1725 25 0x55c079b449fc _PyFunction_Vectorcall + 124 26 0x55c079b53492 PyObject_Call + 290 27 0x55c079b2f5d7 _PyEval_EvalFrameDefault + 10791 28 0x55c079b449fc _PyFunction_Vectorcall + 124 29 0x55c079b53492 PyObject_Call + 290 30 0x55c079b2f5d7 _PyEval_EvalFrameDefault + 10791 31 0x55c079b449fc _PyFunction_Vectorcall + 124 32 0x55c079b53492 PyObject_Call + 290 33 0x55c079b2f5d7 _PyEval_EvalFrameDefault + 10791 34 0x55c079b449fc _PyFunction_Vectorcall + 124 35 0x55c079b2d26d _PyEval_EvalFrameDefault + 1725 36 0x55c079b299c6 /usr/bin/python3(+0x13f9c6) [0x55c079b299c6] 37 0x55c079c1f256 PyEval_EvalCode + 134 38 0x55c079c4a108 /usr/bin/python3(+0x260108) [0x55c079c4a108] 39 0x55c079c439cb /usr/bin/python3(+0x2599cb) [0x55c079c439cb] 40 0x55c079c49e55 /usr/bin/python3(+0x25fe55) [0x55c079c49e55] 41 0x55c079c49338 _PyRun_SimpleFileObject + 424 42 0x55c079c48f83 _PyRun_AnyFileObject + 67 43 0x55c079c3ba5e Py_RunMain + 702 44 0x55c079c1202d Py_BytesMain + 45 45 0x7f232f68bd90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f232f68bd90] 46 0x7f232f68be40 libc_start_main + 128 47 0x55c079c11f25 _start + 37 [14339a567a22:09001] Process received signal [14339a567a22:09001] Signal: Aborted (6) [14339a567a22:09001] Signal code: (-6) [14339a567a22:09001] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f232f6a4520] [14339a567a22:09001] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f232f6f89fc] [14339a567a22:09001] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f232f6a4476] [14339a567a22:09001] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f232f68a7f3] [14339a567a22:09001] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f228ce76b9e] [14339a567a22:09001] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f228ce8220c] [14339a567a22:09001] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f228ce811e9] [14339a567a22:09001] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(gxx_personality_v0+0x99)[0x7f228ce81959] [14339a567a22:09001] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f232f394884] [14339a567a22:09001] [ 9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f232f3952dd] [14339a567a22:09001] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x165015)[0x7f20b017c015] [14339a567a22:09001] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins22MixtureOfExpertsPlugin10initializeEv+0x29)[0x7f20b015c1a9] [14339a567a22:09001] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x108c6e5)[0x7f227b1b46e5] [14339a567a22:09001] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1019de2)[0x7f227b141de2] [14339a567a22:09001] [14] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe0456c)[0x7f227af2c56c] [14339a567a22:09001] [15] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe0621c)[0x7f227af2e21c] [14339a567a22:09001] [16] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe08328)[0x7f227af30328] [14339a567a22:09001] [17] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa572ac)[0x7f227ab7f2ac] [14339a567a22:09001] [18] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa5c501)[0x7f227ab84501] [14339a567a22:09001] [19] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa5cf0b)[0x7f227ab84f0b] [14339a567a22:09001] [20] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa7458)[0x7f2224aa7458] [14339a567a22:09001] [21] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x458f3)[0x7f2224a458f3] [14339a567a22:09001] [22] /usr/bin/python3(+0x15a10e)[0x55c079b4410e] [14339a567a22:09001] [23] /usr/bin/python3(_PyObject_MakeTpCall+0x25b)[0x55c079b3aa7b] [14339a567a22:09001] [24] /usr/bin/python3(+0x168acb)[0x55c079b52acb] [14339a567a22:09001] [25] /usr/bin/python3(_PyEval_EvalFrameDefault+0x614a)[0x55c079b32cfa] [14339a567a22:09001] [26] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x55c079b449fc] [14339a567a22:09001] [27] /usr/bin/python3(_PyEval_EvalFrameDefault+0x2a27)[0x55c079b2f5d7] [14339a567a22:09001] [28] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x55c079b449fc] [14339a567a22:09001] [29] /usr/bin/python3(_PyEval_EvalFrameDefault+0x8ac)[0x55c079b2d45c] [14339a567a22:09001] End of error message Aborted (core dumped)

additional notes

-

OrZipori commented 2 weeks ago

Mixtral 8x7B requires more than 2 GPUS (L40S).. --tp_size 4 solved it.