TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 233, in from_json_file
return PretrainedConfig.from_dict(config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 169, in from_dict
architecture = config.pop('architecture')
KeyError: 'architecture'
Expected behavior
The expected behaviour should be that I should get a rank0.engine binary created inside the output_dir just like when I am doing for other precisions like float16, int8 and int4.
actual behavior
Here is the full logs for both of the commands:
Engine building command 1:
root@d0453720d865:/app/tensorrt_llm/examples/llama# python3 convert_checkpoint.py --model_dir /mnt/models/llama-2-7b-chat-hf --output_dir /mnt/models/llama-2-7b-chat-float32 --dtype float32
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600
0.10.0.dev2024041600
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.13s/it]
Weights loaded. Total time: 00:00:20
Total time of converting checkpoints: 00:01:11
root@d0453720d865:/app/tensorrt_llm/examples/llama# trtllm-build --checkpoint_dir /mnt/models/llama-2-7b-chat-float32/ --output_dir /mnt/models/llama-2-7b-chat-float32/ --gemm_plugin float32
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600
[04/22/2024-12:16:54] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set gemm_plugin to float32.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set nccl_plugin to float16.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set lookup_plugin to None.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set lora_plugin to None.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set moe_plugin to float16.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set context_fmha to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set remove_input_padding to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set multi_block_mode to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set enable_xqa to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set tokens_per_block to 128.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set multiple_profiles to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set paged_state to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set streamingllm to False.
[04/22/2024-12:16:54] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len.
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[04/22/2024-12:16:54] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[04/22/2024-12:16:55] [TRT] [I] [MemUsageChange] Init CUDA: CPU +16, GPU +0, now: CPU 261, GPU 3130 (MiB)
[04/22/2024-12:16:57] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2370, GPU 3480 (MiB)
[04/22/2024-12:16:57] [TRT-LLM] [I] Set nccl_plugin to None.
[04/22/2024-12:16:57] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/22/2024-12:16:57] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[04/22/2024-12:16:57] [TRT] [W] Unused Input: position_ids
[04/22/2024-12:16:57] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[04/22/2024-12:16:57] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2417, GPU 3506 (MiB)
[04/22/2024-12:16:57] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2418, GPU 3516 (MiB)
[04/22/2024-12:16:57] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[04/22/2024-12:16:57] [TRT] [E] 9: LLaMAForCausalLM/transformer/layers/0/attention/PLUGIN_V2_GPTAttention_0: could not find any supported formats consistent with input/output data types
[04/22/2024-12:16:57] [TRT] [E] 9: [pluginV2Builder.cpp::reportPluginError::24] Error Code 9: Internal Error (LLaMAForCausalLM/transformer/layers/0/attention/PLUGIN_V2_GPTAttention_0: could not find any supported formats consistent with input/output data types)
[04/22/2024-12:16:57] [TRT-LLM] [E] Engine building failed, please check the error log.
[04/22/2024-12:16:57] [TRT] [I] Serialized 59 bytes of code generator cache.
[04/22/2024-12:16:57] [TRT] [I] Serialized 0 timing cache entries
[04/22/2024-12:16:57] [TRT-LLM] [I] Timing cache serialized to model.cache
[04/22/2024-12:16:57] [TRT-LLM] [I] Serializing engine to /mnt/models/llama-2-7b-chat-float32/rank0.engine...
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 454, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 342, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in build_and_save
engine.save(output_dir)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 569, in save
serialize_engine(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 105, in serialize_engine
f.write(engine)
TypeError: a bytes-like object is required, not 'NoneType'
Engine building command 2:
root@d0453720d865:/app/tensorrt_llm/examples/llama# trtllm-build --checkpoint_dir /mnt/models/llama-2-7b-chat-float32/ --output_dir /mnt/models/llama-2-7b-chat-float32/ --gemm_plugin float32 --strongly_typed
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600
[04/22/2024-12:18:18] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set gemm_plugin to float32.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set nccl_plugin to float16.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set lookup_plugin to None.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set lora_plugin to None.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set moe_plugin to float16.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set context_fmha to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set remove_input_padding to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set multi_block_mode to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set enable_xqa to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set tokens_per_block to 128.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set multiple_profiles to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set paged_state to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set streamingllm to False.
[04/22/2024-12:18:18] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len.
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[04/22/2024-12:18:18] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 454, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 324, in parallel_build
model_config = PretrainedConfig.from_json_file(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 233, in from_json_file
return PretrainedConfig.from_dict(config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 169, in from_dict
architecture = config.pop('architecture')
KeyError: 'architecture'
root@d0453720d865:/app/tensorrt_llm/examples/llama#
additional notes
I also do not see anything related float32 examples inside tensorrt llm. Is it something not implemented?
System Info
Who can help?
@byshiue @kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Here are the steps to reproduction. Let's say you have the huggingface model weights folder already downloaded.
Step 1: Conversion to safetensor (working)
Step 2: Conversion to engine file (not working)
This gives me error of:
I even used
--strongly_typed
flag, and it then gave me a different errorThis is giving me this error:
Expected behavior
The expected behaviour should be that I should get a
rank0.engine
binary created inside theoutput_dir
just like when I am doing for other precisions likefloat16
,int8
andint4
.actual behavior
Here is the full logs for both of the commands:
Engine building command 1:
Engine building command 2:
additional notes
I also do not see anything related
float32
examples inside tensorrt llm. Is it something not implemented?