NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

TypeError: a bytes-like object is required, not 'NoneType' when trying to build engine in `float32` precision #1485

Closed Anindyadeep closed 4 months ago

Anindyadeep commented 4 months ago

System Info

Who can help?

@byshiue @kaiyux

Information

Tasks

Reproduction

Here are the steps to reproduction. Let's say you have the huggingface model weights folder already downloaded.

Step 1: Conversion to safetensor (working)

python3 convert_checkpoint.py --model_dir /mnt/models/llama-2-7b-chat-hf --output_dir /mnt/models/llama-2-7b-chat-float32 --dtype float32 

Step 2: Conversion to engine file (not working)

trtllm-build --checkpoint_dir /mnt/models/llama-2-7b-chat-float32/ --output_dir /mnt/models/llama-2-7b-chat-float32/ --gemm_plugin float32

This gives me error of:

File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in
TypeError: a bytes-like object is required, not 'NoneType'

I even used --strongly_typed flag, and it then gave me a different error

trtllm-build --checkpoint_dir /mnt/models/llama-2-7b-chat-float32/ --output_dir /mnt/models/llama-2-7b-chat-float32/ --gemm_plugin float32 --strongly_typed

This is giving me this error:

  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 233, in from_json_file
    return PretrainedConfig.from_dict(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 169, in from_dict
    architecture = config.pop('architecture')
KeyError: 'architecture'

Expected behavior

The expected behaviour should be that I should get a rank0.engine binary created inside the output_dir just like when I am doing for other precisions like float16, int8 and int4.

actual behavior

Here is the full logs for both of the commands:

Engine building command 1:

root@d0453720d865:/app/tensorrt_llm/examples/llama# python3 convert_checkpoint.py --model_dir /mnt/models/llama-2-7b-chat-hf --output_dir /mnt/models/llama-2-7b-chat-float32 --dtype float32 
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600
0.10.0.dev2024041600
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.13s/it]
Weights loaded. Total time: 00:00:20
Total time of converting checkpoints: 00:01:11
root@d0453720d865:/app/tensorrt_llm/examples/llama# trtllm-build --checkpoint_dir /mnt/models/llama-2-7b-chat-float32/ --output_dir /mnt/models/llama-2-7b-chat-float32/ --gemm_plugin float32
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600
[04/22/2024-12:16:54] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set gemm_plugin to float32.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set nccl_plugin to float16.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set lookup_plugin to None.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set lora_plugin to None.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set moe_plugin to float16.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set context_fmha to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set remove_input_padding to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set multi_block_mode to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set enable_xqa to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set tokens_per_block to 128.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set multiple_profiles to False.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set paged_state to True.
[04/22/2024-12:16:54] [TRT-LLM] [I] Set streamingllm to False.
[04/22/2024-12:16:54] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[04/22/2024-12:16:54] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[04/22/2024-12:16:55] [TRT] [I] [MemUsageChange] Init CUDA: CPU +16, GPU +0, now: CPU 261, GPU 3130 (MiB)
[04/22/2024-12:16:57] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2370, GPU 3480 (MiB)
[04/22/2024-12:16:57] [TRT-LLM] [I] Set nccl_plugin to None.
[04/22/2024-12:16:57] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/22/2024-12:16:57] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[04/22/2024-12:16:57] [TRT] [W] Unused Input: position_ids
[04/22/2024-12:16:57] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[04/22/2024-12:16:57] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2417, GPU 3506 (MiB)
[04/22/2024-12:16:57] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2418, GPU 3516 (MiB)
[04/22/2024-12:16:57] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[04/22/2024-12:16:57] [TRT] [E] 9: LLaMAForCausalLM/transformer/layers/0/attention/PLUGIN_V2_GPTAttention_0: could not find any supported formats consistent with input/output data types
[04/22/2024-12:16:57] [TRT] [E] 9: [pluginV2Builder.cpp::reportPluginError::24] Error Code 9: Internal Error (LLaMAForCausalLM/transformer/layers/0/attention/PLUGIN_V2_GPTAttention_0: could not find any supported formats consistent with input/output data types)
[04/22/2024-12:16:57] [TRT-LLM] [E] Engine building failed, please check the error log.
[04/22/2024-12:16:57] [TRT] [I] Serialized 59 bytes of code generator cache.
[04/22/2024-12:16:57] [TRT] [I] Serialized 0 timing cache entries
[04/22/2024-12:16:57] [TRT-LLM] [I] Timing cache serialized to model.cache
[04/22/2024-12:16:57] [TRT-LLM] [I] Serializing engine to /mnt/models/llama-2-7b-chat-float32/rank0.engine...
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 454, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 342, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in build_and_save
    engine.save(output_dir)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 569, in save
    serialize_engine(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 105, in serialize_engine
    f.write(engine)
TypeError: a bytes-like object is required, not 'NoneType'

Engine building command 2:

root@d0453720d865:/app/tensorrt_llm/examples/llama# trtllm-build --checkpoint_dir /mnt/models/llama-2-7b-chat-float32/ --output_dir /mnt/models/llama-2-7b-chat-float32/ --gemm_plugin float32 --strongly_typed
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600
[04/22/2024-12:18:18] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set gemm_plugin to float32.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set nccl_plugin to float16.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set lookup_plugin to None.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set lora_plugin to None.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set moe_plugin to float16.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set context_fmha to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set remove_input_padding to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set multi_block_mode to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set enable_xqa to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set tokens_per_block to 128.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set multiple_profiles to False.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set paged_state to True.
[04/22/2024-12:18:18] [TRT-LLM] [I] Set streamingllm to False.
[04/22/2024-12:18:18] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[04/22/2024-12:18:18] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 454, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 324, in parallel_build
    model_config = PretrainedConfig.from_json_file(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 233, in from_json_file
    return PretrainedConfig.from_dict(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 169, in from_dict
    architecture = config.pop('architecture')
KeyError: 'architecture'
root@d0453720d865:/app/tensorrt_llm/examples/llama# 

additional notes

I also do not see anything related float32 examples inside tensorrt llm. Is it something not implemented?

byshiue commented 4 months ago

You also need to set gpt_attention_plugin to float32 in first building command.

Anindyadeep commented 4 months ago

Ahh I see, okay let me check that out with that. Thanks :)

Anindyadeep commented 4 months ago

Hey @byshiue this works, thanks a lot