NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.63k stars 832 forks source link

can't save the engine when running triton-build #1433

Open YunChen1227 opened 3 months ago

YunChen1227 commented 3 months ago

System Info

3090 server

Who can help?

No response

Information

Tasks

Reproduction

python convert_checkpoint.py --model_dir ./tmp/llama/7B/ \ --output_dir ./tllm_checkpoint_1gpu_fp16 \ --dtype float16

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 \ --output_dir ./tmp/llama/7B/trt_engines/fp16/1-gpu \ --gemm_plugin float16

Expected behavior

the engine directory should been build successfully

actual behavior

Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 440, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 332, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 298, in build_and_save engine.save(output_dir) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 566, in save serialize_engine( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 105, in serialize_engine f.write(engine) TypeError: a bytes-like object is required, not 'NoneType'

additional notes

the model we want to convert is not the original llama2. we have already done sft training

jli943 commented 3 months ago

Did you solve it? I have the same problem.

byshiue commented 3 months ago

Please share the full building log.

QJ1234 commented 3 months ago

I have the same problem too,here's the full log: (tensorrt) onatter@Onatter:~/TensorRT-LLM/examples/chatglm$ trtllm-build --checkpoint_dir trt_ckpt/chatglm3_6b_32k/ --gemm_plugin float16 \ --output_dir trt_engines/chatglm3_6b/fp16/1-gpu [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [04/17/2024-13:22:01] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set gemm_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set nccl_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set lookup_plugin to None. [04/17/2024-13:22:01] [TRT-LLM] [I] Set lora_plugin to None. [04/17/2024-13:22:01] [TRT-LLM] [I] Set moe_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set context_fmha to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set paged_kv_cache to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set remove_input_padding to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set multi_block_mode to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set enable_xqa to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set tokens_per_block to 128. [04/17/2024-13:22:01] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set multiple_profiles to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set paged_state to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set streamingllm to False. [04/17/2024-13:22:01] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/17/2024-13:22:01] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[04/17/2024-13:22:01] [TRT-LLM] [I] Compute capability: (8, 6) [04/17/2024-13:22:01] [TRT-LLM] [I] SM count: 30 [04/17/2024-13:22:01] [TRT-LLM] [I] SM clock: 2100 MHz [04/17/2024-13:22:01] [TRT-LLM] [I] int4 TFLOPS: 258 [04/17/2024-13:22:01] [TRT-LLM] [I] int8 TFLOPS: 129 [04/17/2024-13:22:01] [TRT-LLM] [I] fp8 TFLOPS: 0 [04/17/2024-13:22:01] [TRT-LLM] [I] float16 TFLOPS: 64 [04/17/2024-13:22:01] [TRT-LLM] [I] bfloat16 TFLOPS: 64 [04/17/2024-13:22:01] [TRT-LLM] [I] float32 TFLOPS: 32 [04/17/2024-13:22:01] [TRT-LLM] [I] Total Memory: 12 GiB [04/17/2024-13:22:01] [TRT-LLM] [I] Memory clock: 7001 MHz [04/17/2024-13:22:01] [TRT-LLM] [I] Memory bus width: 192 [04/17/2024-13:22:01] [TRT-LLM] [I] Memory bandwidth: 336 GB/s [04/17/2024-13:22:01] [TRT-LLM] [I] PCIe speed: 8000 Mbps [04/17/2024-13:22:01] [TRT-LLM] [I] PCIe link width: 8 [04/17/2024-13:22:01] [TRT-LLM] [I] PCIe bandwidth: 8 GB/s [04/17/2024-13:22:01] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 588, GPU 1046 (MiB) [04/17/2024-13:22:09] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1812, GPU +312, now: CPU 2536, GPU 1358 (MiB) [04/17/2024-13:22:09] [TRT-LLM] [I] Set nccl_plugin to None. [04/17/2024-13:22:09] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and ChatGLMForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/1/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/1/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/1/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/1/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/1/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/1/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/2/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/2/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/2/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/2/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/2/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/2/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/2/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/2/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/3/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/3/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/3/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/3/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/3/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/3/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/3/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/3/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/4/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/4/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/4/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/4/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/4/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/4/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/4/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/4/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/5/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/5/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/5/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/5/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/5/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/5/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/5/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/5/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/6/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/6/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/6/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/6/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/6/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/6/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/6/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/6/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/7/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/7/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/7/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/7/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/7/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/7/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/7/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/7/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/8/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/8/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/8/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/8/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/8/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/8/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/8/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/8/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/9/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/9/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/9/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/9/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/9/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/9/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/9/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/9/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/10/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/10/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/10/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/10/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/10/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/10/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/10/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/10/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/11/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/11/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/11/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/11/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/11/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/11/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/11/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/11/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/12/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/12/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/12/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/12/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/12/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/12/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/12/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/12/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/13/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/13/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/13/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/13/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/13/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/13/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/13/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/13/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/14/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/14/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/14/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/14/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/14/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/14/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/14/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/14/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/15/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/15/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/15/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/15/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/15/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/15/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/15/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/15/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/16/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/16/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/16/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/16/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/16/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/16/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/16/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/16/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/17/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/17/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/17/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/17/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/17/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/17/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/17/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/17/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/18/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/18/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/18/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/18/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/18/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/18/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/18/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/18/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/19/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/19/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/19/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/19/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/19/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/19/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/19/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/19/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/20/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/20/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/20/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/20/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/20/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/20/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/20/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/20/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/21/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/21/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/21/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/21/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/21/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/21/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/21/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/21/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/22/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/22/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/22/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/22/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/22/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/22/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/22/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/22/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/23/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/23/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/23/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/23/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/23/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/23/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/23/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/23/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/24/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/24/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/24/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/24/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/24/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/24/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/24/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/24/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/25/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/25/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/25/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/25/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/25/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/25/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/25/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/25/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/26/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/26/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/26/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/26/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/26/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/26/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/26/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/26/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/27/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/27/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/27/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/27/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/27/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/27/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/27/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/27/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/ln_f/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/ln_f/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/ln_f/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/17/2024-13:22:09] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [04/17/2024-13:22:09] [TRT] [W] Unused Input: position_ids [04/17/2024-13:22:09] [TRT] [W] Detected layernorm nodes in FP16. [04/17/2024-13:22:09] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [04/17/2024-13:22:09] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [04/17/2024-13:22:09] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2584, GPU 1386 (MiB) [04/17/2024-13:22:09] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2585, GPU 1394 (MiB) [04/17/2024-13:22:09] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [04/17/2024-13:22:09] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [04/17/2024-13:22:43] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [04/17/2024-13:22:43] [TRT] [I] Detected 14 inputs and 1 output network tensors. [04/17/2024-13:23:06] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information) [04/17/2024-13:23:06] [TRT] [E] 1: [virtualMemoryBuffer.cpp::resizePhysical::127] Error Code 1: Cuda Driver (invalid argument) [04/17/2024-13:23:06] [TRT] [W] Requested amount of GPU memory (11430526976 bytes) could not be allocated. There may not be enough free memory for allocation to succeed. [04/17/2024-13:23:06] [TRT] [E] 2: [04/17/2024-13:23:06] [TRT] [E] 2: [globWriter.cpp::makeResizableGpuMemory::423] Error Code 2: OutOfMemory (no further information) [04/17/2024-13:23:06] [TRT-LLM] [E] Engine building failed, please check the error log. [04/17/2024-13:23:06] [TRT] [I] Serialized 59 bytes of code generator cache. [04/17/2024-13:23:06] [TRT] [I] Serialized 165095 bytes of compilation cache. [04/17/2024-13:23:06] [TRT] [I] Serialized 26 timing cache entries [04/17/2024-13:23:06] [TRT-LLM] [I] Timing cache serialized to model.cache [04/17/2024-13:23:06] [TRT-LLM] [I] Serializing engine to trt_engines/chatglm3_6b/fp16/1-gpu/rank0.engine... Traceback (most recent call last): File "/home/onatter/miniconda3/envs/tensorrt/bin/trtllm-build", line 8, in sys.exit(main()) File "/home/onatter/miniconda3/envs/tensorrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 454, in main parallel_build(source, build_config, args.output_dir, workers, File "/home/onatter/miniconda3/envs/tensorrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 342, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/home/onatter/miniconda3/envs/tensorrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 308, in build_and_save engine.save(output_dir) File "/home/onatter/miniconda3/envs/tensorrt/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 569, in save serialize_engine( File "/home/onatter/miniconda3/envs/tensorrt/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 105, in serialize_engine f.write(engine) TypeError: a bytes-like object is required, not 'NoneType'

Romzzeess commented 3 months ago

Set flag --gpt_attention_plugin bfloat16, its work to me

jli943 commented 3 months ago

my command trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp8 \ --output_dir ./engine_outputs \ --gemm_plugin float16 \ --strongly_typed \ --gpt_attention_plugin bfloat16 \ --workers 1

this is my log, oot@696da90ac847:/TensorRT-LLM/examples/llama/run/quantization/fp8# ./build.sh [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040900 [04/17/2024-16:36:51] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set gemm_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set nccl_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set lookup_plugin to None. [04/17/2024-16:36:51] [TRT-LLM] [I] Set lora_plugin to None. [04/17/2024-16:36:51] [TRT-LLM] [I] Set moe_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set context_fmha to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set paged_kv_cache to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set remove_input_padding to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set multi_block_mode to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set enable_xqa to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set tokens_per_block to 128. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set multiple_profiles to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set paged_state to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set streamingllm to False. [04/17/2024-16:36:51] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/17/2024-16:36:51] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[04/17/2024-16:36:55] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2145, GPU +0, now: CPU 2991, GPU 9747 (MiB) [04/17/2024-16:37:00] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1789, GPU +316, now: CPU 4915, GPU 10065 (MiB) [04/17/2024-16:37:00] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading [04/17/2024-16:37:00] [TRT-LLM] [I] Set nccl_plugin to None. [04/17/2024-16:37:00] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/17/2024-16:37:08] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [04/17/2024-16:37:08] [TRT] [W] Unused Input: position_ids [04/17/2024-16:37:08] [TRT] [E] 9: [standardEngineBuilder.cpp::buildEngine::2266] Error Code 9: Internal Error (Networks with FP8 precision require hardware with FP8 support.) [04/17/2024-16:37:08] [TRT-LLM] [E] Engine building failed, please check the error log. [04/17/2024-16:37:08] [TRT] [I] Serialized 59 bytes of code generator cache. [04/17/2024-16:37:08] [TRT] [I] Serialized 0 timing cache entries [04/17/2024-16:37:08] [TRT-LLM] [I] Timing cache serialized to model.cache [04/17/2024-16:37:08] [TRT-LLM] [I] Serializing engine to ./engine_outputs/rank0.engine... Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 441, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 332, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 298, in build_and_save engine.save(output_dir) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 566, in save serialize_engine( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 105, in serialize_engine f.write(engine) TypeError: a bytes-like object is required, not 'NoneType'

QJ1234 commented 3 months ago

I guess it might just be that I don’t have enough CUDA memory.I worked with int8 weight only quantization.You can have a try.

byshiue commented 2 months ago

my command trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp8 --output_dir ./engine_outputs --gemm_plugin float16 --strongly_typed --gpt_attention_plugin bfloat16 --workers 1

this is my log, oot@696da90ac847:/TensorRT-LLM/examples/llama/run/quantization/fp8# ./build.sh [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040900 [04/17/2024-16:36:51] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set gemm_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set nccl_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set lookup_plugin to None. [04/17/2024-16:36:51] [TRT-LLM] [I] Set lora_plugin to None. [04/17/2024-16:36:51] [TRT-LLM] [I] Set moe_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set context_fmha to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set paged_kv_cache to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set remove_input_padding to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set multi_block_mode to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set enable_xqa to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set tokens_per_block to 128. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set multiple_profiles to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set paged_state to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set streamingllm to False. [04/17/2024-16:36:51] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size_max_input_len. It may not be optimal to set max_num_tokens=max_batch_size_max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/17/2024-16:36:51] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[04/17/2024-16:36:55] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2145, GPU +0, now: CPU 2991, GPU 9747 (MiB) [04/17/2024-16:37:00] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1789, GPU +316, now: CPU 4915, GPU 10065 (MiB) [04/17/2024-16:37:00] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading [04/17/2024-16:37:00] [TRT-LLM] [I] Set nccl_plugin to None. [04/17/2024-16:37:00] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/17/2024-16:37:08] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [04/17/2024-16:37:08] [TRT] [W] Unused Input: position_ids [04/17/2024-16:37:08] [TRT] [E] 9: [standardEngineBuilder.cpp::buildEngine::2266] Error Code 9: Internal Error (Networks with FP8 precision require hardware with FP8 support.) [04/17/2024-16:37:08] [TRT-LLM] [E] Engine building failed, please check the error log. [04/17/2024-16:37:08] [TRT] [I] Serialized 59 bytes of code generator cache. [04/17/2024-16:37:08] [TRT] [I] Serialized 0 timing cache entries [04/17/2024-16:37:08] [TRT-LLM] [I] Timing cache serialized to model.cache [04/17/2024-16:37:08] [TRT-LLM] [I] Serializing engine to ./engine_outputs/rank0.engine... Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 441, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 332, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 298, in build_and_save engine.save(output_dir) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 566, in save serialize_engine( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 105, in serialize_engine f.write(engine) TypeError: a bytes-like object is required, not 'NoneType'

You cannot build fp8 engine on hardware which does not support fp8.