NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

TypeError: a bytes-like object is required, not 'NoneType' #1479

Closed wu1143100799 closed 4 months ago

wu1143100799 commented 4 months ago

command: trtllm-build --checkpoint_dir gpt2/trt_ckpt/fp8/1-gpu \ --gpt_attention_plugin float16 \ --remove_input_padding enable \ --strongly_typed \ --output_dir gpt2/trt_engines/fp8/1-gpu \

running log: [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [04/22/2024-04:03:27] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/22/2024-04:03:27] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/22/2024-04:03:27] [TRT-LLM] [I] Set gemm_plugin to None. [04/22/2024-04:03:27] [TRT-LLM] [I] Set nccl_plugin to float16. [04/22/2024-04:03:27] [TRT-LLM] [I] Set lookup_plugin to None. [04/22/2024-04:03:27] [TRT-LLM] [I] Set lora_plugin to None. [04/22/2024-04:03:27] [TRT-LLM] [I] Set moe_plugin to float16. [04/22/2024-04:03:27] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/22/2024-04:03:27] [TRT-LLM] [I] Set context_fmha to True. [04/22/2024-04:03:27] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/22/2024-04:03:27] [TRT-LLM] [I] Set paged_kv_cache to True. [04/22/2024-04:03:27] [TRT-LLM] [I] Set remove_input_padding to True. [04/22/2024-04:03:27] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/22/2024-04:03:27] [TRT-LLM] [I] Set multi_block_mode to False. [04/22/2024-04:03:27] [TRT-LLM] [I] Set enable_xqa to True. [04/22/2024-04:03:27] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/22/2024-04:03:27] [TRT-LLM] [I] Set tokens_per_block to 128. [04/22/2024-04:03:27] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/22/2024-04:03:27] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/22/2024-04:03:27] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/22/2024-04:03:27] [TRT-LLM] [I] Set multiple_profiles to False. [04/22/2024-04:03:27] [TRT-LLM] [I] Set paged_state to True. [04/22/2024-04:03:27] [TRT-LLM] [I] Set streamingllm to False. [04/22/2024-04:03:27] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/22/2024-04:03:27] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[04/22/2024-04:03:28] [TRT] [I] [MemUsageChange] Init CUDA: CPU +9, GPU +0, now: CPU 129, GPU 237 (MiB) [04/22/2024-04:03:31] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1974, GPU +350, now: CPU 2239, GPU 587 (MiB) [04/22/2024-04:03:31] [TRT-LLM] [I] Set nccl_plugin to None. [04/22/2024-04:03:31] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/22/2024-04:03:32] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [04/22/2024-04:03:32] [TRT] [E] 9: [standardEngineBuilder.cpp::buildEngine::2266] Error Code 9: Internal Error (Networks with FP8 precision require hardware with FP8 support.) [04/22/2024-04:03:32] [TRT-LLM] [E] Engine building failed, please check the error log. [04/22/2024-04:03:32] [TRT] [I] Serialized 59 bytes of code generator cache. [04/22/2024-04:03:32] [TRT] [I] Serialized 0 timing cache entries [04/22/2024-04:03:32] [TRT-LLM] [I] Timing cache serialized to model.cache [04/22/2024-04:03:32] [TRT-LLM] [I] Serializing engine to gpt2/trt_engines/fp8/1-gpu/rank0.engine... Traceback (most recent call last): File "/opt/python/3.10.12/bin/trtllm-build", line 8, in sys.exit(main()) File "/opt/python/3.10.12/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 454, in main parallel_build(source, build_config, args.output_dir, workers, File "/opt/python/3.10.12/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 342, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/opt/python/3.10.12/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 308, in build_and_save engine.save(output_dir) File "/opt/python/3.10.12/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 569, in save serialize_engine( File "/opt/python/3.10.12/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 105, in serialize_engine f.write(engine) TypeError: a bytes-like object is required, not 'NoneType'

How to solve this issue? Thanks, My GPU is A30.

byshiue commented 4 months ago

To build engine with FP8, you must use hardware supporting FP8, like Ada or Hopper. A30 does not support FP8.