NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.4k stars 800 forks source link

Faild to build llama3 70B when worker > 1 and tp_size=8 #1696

Open WDONG66 opened 1 month ago

WDONG66 commented 1 month ago

System Info

trt_llm 0.11.0.dev2024052800 trt 10.0.1 device A800 coda for Tensorrt_llm: latest version in main branch

Who can help?

@byshiue

Information

Tasks

Reproduction

trtllm-build --checkpoint_dir {checkpoint_dir} --output_dir {output_dir} --gemm_plugin float16 --gpt_attention_plugin float16 --paged_kv_cache enable --max_batch_size 16 --max_input_len 2048 --max_output_len 2048 --tp_size 8 --pp_size 1 --workers 8

Expected behavior

build successfully with worker > 1

actual behavior

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

additional notes

None

nv-guomingz commented 1 month ago

Hi @WDONG66 would u please share us the convert cmd for generating checkpoint files?

WDONG66 commented 1 month ago

OK , the convert cmd is as following: python3 convert_checkpoint.py --model_dir={model_dir} --output_dir={output_dir} --dtype float16 --tp_size=8 --pp_size=1

nv-guomingz commented 1 month ago

Hi @WDONG66 , I can't reproduce your issue on 8xL40s. Could u please paste the full log of building the engine?

Here is the log output of building engine on my local env.

python3 -m tensorrt_llm.commands.build --checkpoint_dir ./tllm_checkpoint_8gpu_sq --output_dir ./engine_outputs --gemm_plugin float16 --gpt_attention_plugin float16 --paged_kv_cache enable --max_batch_size 16 --max_input_len 2048 --max_output_len 2048 --tp_size 8 --pp_size 1 --workers 8 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [05/30/2024-07:33:35] [TRT-LLM] [I] Set bert_attention_plugin to auto. [05/30/2024-07:33:35] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [05/30/2024-07:33:35] [TRT-LLM] [I] Set gemm_plugin to float16. [05/30/2024-07:33:35] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [05/30/2024-07:33:35] [TRT-LLM] [I] Set nccl_plugin to auto. [05/30/2024-07:33:35] [TRT-LLM] [I] Set lookup_plugin to None. [05/30/2024-07:33:35] [TRT-LLM] [I] Set lora_plugin to None. [05/30/2024-07:33:35] [TRT-LLM] [I] Set moe_plugin to auto. [05/30/2024-07:33:35] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [05/30/2024-07:33:35] [TRT-LLM] [I] Set context_fmha to True. [05/30/2024-07:33:35] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [05/30/2024-07:33:35] [TRT-LLM] [I] Set paged_kv_cache to True. [05/30/2024-07:33:35] [TRT-LLM] [I] Set remove_input_padding to True. [05/30/2024-07:33:35] [TRT-LLM] [I] Set use_custom_all_reduce to True. [05/30/2024-07:33:35] [TRT-LLM] [I] Set multi_block_mode to False. [05/30/2024-07:33:35] [TRT-LLM] [I] Set enable_xqa to True. [05/30/2024-07:33:35] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [05/30/2024-07:33:35] [TRT-LLM] [I] Set tokens_per_block to 64. [05/30/2024-07:33:35] [TRT-LLM] [I] Set use_paged_context_fmha to False. [05/30/2024-07:33:35] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [05/30/2024-07:33:35] [TRT-LLM] [I] Set multiple_profiles to False. [05/30/2024-07:33:35] [TRT-LLM] [I] Set paged_state to True. [05/30/2024-07:33:35] [TRT-LLM] [I] Set streamingllm to False. [05/30/2024-07:33:35] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input e very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [05/30/2024-07:33:35] [TRT-LLM] [W] Specifying a max_num_tokens larger than 16384 is usually not recommended, we do not expect perf gain with tho large max_num_tokens could possibly exceed the TensorRT tensor volume, causing runtime errors. Got max_num_tokens = 32768 [05/30/2024-07:33:40] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [05/30/2024-07:33:40] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [05/30/2024-07:33:40] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [05/30/2024-07:33:40] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info [05/30/2024-07:33:40] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [05/30/2024-07:33:40] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [05/30/2024-07:33:40] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [05/30/2024-07:33:40] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [05/30/2024-07:33:40] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [05/30/2024-07:33:40] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info [05/30/2024-07:33:40] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info [05/30/2024-07:33:41] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info [05/30/2024-07:33:41] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info [05/30/2024-07:33:41] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info [05/30/2024-07:33:41] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info [05/30/2024-07:33:41] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info [05/30/2024-07:33:48] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [05/30/2024-07:33:48] [TRT] [W] Unused Input: position_ids [05/30/2024-07:33:49] [TRT] [W] Detected layernorm nodes in FP16. [05/30/2024-07:33:49] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNlater than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [05/30/2024-07:33:49] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [05/30/2024-07:33:55] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [05/30/2024-07:33:56] [TRT] [W] Unused Input: position_ids [05/30/2024-07:33:56] [TRT] [W] Detected layernorm nodes in FP16. [05/30/2024-07:33:56] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNlater than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [05/30/2024-07:33:56] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [05/30/2024-07:33:57] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [05/30/2024-07:33:58] [TRT] [W] Unused Input: position_ids [05/30/2024-07:33:58] [TRT] [W] Detected layernorm nodes in FP16. [05/30/2024-07:33:58] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNlater than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [05/30/2024-07:33:58] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [05/30/2024-07:34:00] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [05/30/2024-07:34:00] [TRT] [W] Unused Input: position_ids [05/30/2024-07:34:01] [TRT] [W] Detected layernorm nodes in FP16. [05/30/2024-07:34:01] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNlater than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [05/30/2024-07:34:01] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [05/30/2024-07:34:01] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [05/30/2024-07:34:01] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [05/30/2024-07:34:01] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [05/30/2024-07:34:02] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [05/30/2024-07:34:02] [TRT] [W] Unused Input: position_ids [05/30/2024-07:34:02] [TRT] [W] Unused Input: position_ids [05/30/2024-07:34:02] [TRT] [W] Detected layernorm nodes in FP16. [05/30/2024-07:34:02] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNlater than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [05/30/2024-07:34:02] [TRT] [W] Unused Input: position_ids [05/30/2024-07:34:02] [TRT] [W] Detected layernorm nodes in FP16. [05/30/2024-07:34:02] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNlater than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [05/30/2024-07:34:02] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [05/30/2024-07:34:02] [TRT] [W] Detected layernorm nodes in FP16. [05/30/2024-07:34:02] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNlater than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [05/30/2024-07:34:02] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [05/30/2024-07:34:02] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [05/30/2024-07:34:02] [TRT] [W] Unused Input: position_ids [05/30/2024-07:34:03] [TRT] [W] Detected layernorm nodes in FP16. [05/30/2024-07:34:03] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNlater than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [05/30/2024-07:34:03] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [05/30/2024-07:46:20] [TRT-LLM] [I] Total time of building all engines: 00:12:44