NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

tensorrt_llm llama Engine build error #1164

Open limes22 opened 6 months ago

limes22 commented 6 months ago

System Info

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \ --output_dir ./tmp/llama/7B/trt_engines/bf16/1-gpu \ --gpt_attention_plugin bfloat16 \ --gemm_plugin bfloat16

An error occurs when the command is entered.

It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 514, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 441, in parallel_build with ProcessPoolExecutor(mp_context=get_context('spawn'), File "/usr/lib/python3.10/concurrent/futures/process.py", line 611, in init raise ValueError("max_workers must be greater than 0") ValueError: max_workers must be greater than 0

Who can help?

No response

Information

Tasks

Reproduction

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \ --output_dir ./tmp/llama/7B/trt_engines/bf16/1-gpu \ --gpt_attention_plugin bfloat16 \ --gemm_plugin bfloat16

Expected behavior

Build Complete

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022000 /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:626: UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML") [02/26/2024-09:18:30] [TRT-LLM] [I] Set bert_attention_plugin to float16. [02/26/2024-09:18:30] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16. [02/26/2024-09:18:30] [TRT-LLM] [I] Set gemm_plugin to bfloat16. [02/26/2024-09:18:30] [TRT-LLM] [I] Set lookup_plugin to None. [02/26/2024-09:18:30] [TRT-LLM] [I] Set lora_plugin to None. [02/26/2024-09:18:30] [TRT-LLM] [I] Set moe_plugin to float16. [02/26/2024-09:18:30] [TRT-LLM] [I] Set context_fmha to True. [02/26/2024-09:18:30] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [02/26/2024-09:18:30] [TRT-LLM] [I] Set paged_kv_cache to True. [02/26/2024-09:18:30] [TRT-LLM] [I] Set remove_input_padding to True. [02/26/2024-09:18:30] [TRT-LLM] [I] Set use_custom_all_reduce to True. [02/26/2024-09:18:30] [TRT-LLM] [I] Set multi_block_mode to False. [02/26/2024-09:18:30] [TRT-LLM] [I] Set enable_xqa to True. [02/26/2024-09:18:30] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [02/26/2024-09:18:30] [TRT-LLM] [I] Set tokens_per_block to 128. [02/26/2024-09:18:30] [TRT-LLM] [I] Set use_paged_context_fmha to False. [02/26/2024-09:18:30] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [02/26/2024-09:18:30] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 514, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 441, in parallel_build with ProcessPoolExecutor(mp_context=get_context('spawn'), File "/usr/lib/python3.10/concurrent/futures/process.py", line 611, in init raise ValueError("max_workers must be greater than 0") ValueError: max_workers must be greater than 0

additional notes

If anyone has solved the problem, please comment.

Vladimir-125 commented 6 months ago

I seem to find the reason! It is because no GPU was found and the number of workers was set to 0 here. Check if your container or host sees GPU by running nvidia-smi. In my case, the container was not connecting to CUDA resulting in the error. I have restarted the container and the problem was resolved.

vladimir1257 commented 6 months ago

GitHub blocked my original account for some reason, so I am reposting it here. Hope it helps!

I seem to find the reason! It is because no GPU was found and the number of workers was set to 0 here. Check if your container or host sees GPU by running nvidia-smi. In my case, the container was not connecting to CUDA resulting in the error. I have restarted the container and the problem was resolved.