Open YunChen1227 opened 3 months ago
Did you solve it? I have the same problem.
Please share the full building log.
I have the same problem too,here's the full log: (tensorrt) onatter@Onatter:~/TensorRT-LLM/examples/chatglm$ trtllm-build --checkpoint_dir trt_ckpt/chatglm3_6b_32k/ --gemm_plugin float16 \ --output_dir trt_engines/chatglm3_6b/fp16/1-gpu [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [04/17/2024-13:22:01] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set gemm_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set nccl_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set lookup_plugin to None. [04/17/2024-13:22:01] [TRT-LLM] [I] Set lora_plugin to None. [04/17/2024-13:22:01] [TRT-LLM] [I] Set moe_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/17/2024-13:22:01] [TRT-LLM] [I] Set context_fmha to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set paged_kv_cache to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set remove_input_padding to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set multi_block_mode to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set enable_xqa to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set tokens_per_block to 128. [04/17/2024-13:22:01] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set multiple_profiles to False. [04/17/2024-13:22:01] [TRT-LLM] [I] Set paged_state to True. [04/17/2024-13:22:01] [TRT-LLM] [I] Set streamingllm to False. [04/17/2024-13:22:01] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/17/2024-13:22:01] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[04/17/2024-13:22:01] [TRT-LLM] [I] Compute capability: (8, 6)
[04/17/2024-13:22:01] [TRT-LLM] [I] SM count: 30
[04/17/2024-13:22:01] [TRT-LLM] [I] SM clock: 2100 MHz
[04/17/2024-13:22:01] [TRT-LLM] [I] int4 TFLOPS: 258
[04/17/2024-13:22:01] [TRT-LLM] [I] int8 TFLOPS: 129
[04/17/2024-13:22:01] [TRT-LLM] [I] fp8 TFLOPS: 0
[04/17/2024-13:22:01] [TRT-LLM] [I] float16 TFLOPS: 64
[04/17/2024-13:22:01] [TRT-LLM] [I] bfloat16 TFLOPS: 64
[04/17/2024-13:22:01] [TRT-LLM] [I] float32 TFLOPS: 32
[04/17/2024-13:22:01] [TRT-LLM] [I] Total Memory: 12 GiB
[04/17/2024-13:22:01] [TRT-LLM] [I] Memory clock: 7001 MHz
[04/17/2024-13:22:01] [TRT-LLM] [I] Memory bus width: 192
[04/17/2024-13:22:01] [TRT-LLM] [I] Memory bandwidth: 336 GB/s
[04/17/2024-13:22:01] [TRT-LLM] [I] PCIe speed: 8000 Mbps
[04/17/2024-13:22:01] [TRT-LLM] [I] PCIe link width: 8
[04/17/2024-13:22:01] [TRT-LLM] [I] PCIe bandwidth: 8 GB/s
[04/17/2024-13:22:01] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 588, GPU 1046 (MiB)
[04/17/2024-13:22:09] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1812, GPU +312, now: CPU 2536, GPU 1358 (MiB)
[04/17/2024-13:22:09] [TRT-LLM] [I] Set nccl_plugin to None.
[04/17/2024-13:22:09] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and ChatGLMForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/1/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/1/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/1/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/1/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/1/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/1/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/2/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/2/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/2/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/2/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/2/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/2/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/2/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/2/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/3/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/3/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/3/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/3/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/3/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/3/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/3/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/3/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/4/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/4/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/4/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/4/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/4/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/4/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/4/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/4/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/5/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/5/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/5/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/5/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/5/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/5/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/5/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/5/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/6/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/6/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/6/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/6/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/6/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/6/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/6/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/6/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/7/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/7/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/7/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/7/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/7/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/7/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/7/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/7/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/8/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/8/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/8/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/8/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/8/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/8/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/8/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/8/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/9/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/9/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/9/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/9/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/9/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/9/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/9/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/9/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/10/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/10/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/10/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/10/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/10/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/10/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/10/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/10/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/11/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/11/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/11/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/11/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/11/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/11/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/11/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/11/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/12/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/12/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/12/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/12/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/12/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/12/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/12/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/12/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/13/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/13/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/13/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/13/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/13/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/13/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/13/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/13/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/14/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/14/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/14/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/14/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/14/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/14/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/14/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/14/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/15/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/15/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/15/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/15/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/15/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/15/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/15/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/15/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/16/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/16/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/16/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/16/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/16/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/16/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/16/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/16/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/17/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/17/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/17/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/17/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/17/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/17/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/17/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/17/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/18/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/18/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/18/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/18/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/18/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/18/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/18/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/18/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/19/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/19/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/19/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/19/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/19/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/19/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/19/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/19/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/20/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/20/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/20/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/20/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/20/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/20/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/20/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/20/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/21/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/21/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/21/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/21/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/21/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/21/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/21/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/21/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/22/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/22/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/22/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/22/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/22/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/22/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/22/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/22/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/23/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/23/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/23/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/23/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/23/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/23/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/23/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/23/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/24/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/24/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/24/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/24/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/24/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/24/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/24/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/24/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/25/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/25/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/25/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/25/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/25/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/25/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/25/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/25/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/26/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/26/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/26/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/26/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/26/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/26/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/26/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/26/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/layers/27/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/27/input_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/27/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/27/ELEMENTWISE_SUM_0_output_0 and ChatGLMForCausalLM/transformer/layers/27/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/27/post_layernorm/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/layers/27/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/layers/27/ELEMENTWISE_SUM_1_output_0 and ChatGLMForCausalLM/transformer/ln_f/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT] [W] IElementWiseLayer with inputs ChatGLMForCausalLM/transformer/ln_f/REDUCE_AVG_0_output_0 and ChatGLMForCausalLM/transformer/ln_f/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/17/2024-13:22:09] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[04/17/2024-13:22:09] [TRT] [W] Unused Input: position_ids
[04/17/2024-13:22:09] [TRT] [W] Detected layernorm nodes in FP16.
[04/17/2024-13:22:09] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[04/17/2024-13:22:09] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[04/17/2024-13:22:09] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2584, GPU 1386 (MiB)
[04/17/2024-13:22:09] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2585, GPU 1394 (MiB)
[04/17/2024-13:22:09] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/17/2024-13:22:09] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[04/17/2024-13:22:43] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[04/17/2024-13:22:43] [TRT] [I] Detected 14 inputs and 1 output network tensors.
[04/17/2024-13:23:06] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[04/17/2024-13:23:06] [TRT] [E] 1: [virtualMemoryBuffer.cpp::resizePhysical::127] Error Code 1: Cuda Driver (invalid argument)
[04/17/2024-13:23:06] [TRT] [W] Requested amount of GPU memory (11430526976 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[04/17/2024-13:23:06] [TRT] [E] 2:
[04/17/2024-13:23:06] [TRT] [E] 2: [globWriter.cpp::makeResizableGpuMemory::423] Error Code 2: OutOfMemory (no further information)
[04/17/2024-13:23:06] [TRT-LLM] [E] Engine building failed, please check the error log.
[04/17/2024-13:23:06] [TRT] [I] Serialized 59 bytes of code generator cache.
[04/17/2024-13:23:06] [TRT] [I] Serialized 165095 bytes of compilation cache.
[04/17/2024-13:23:06] [TRT] [I] Serialized 26 timing cache entries
[04/17/2024-13:23:06] [TRT-LLM] [I] Timing cache serialized to model.cache
[04/17/2024-13:23:06] [TRT-LLM] [I] Serializing engine to trt_engines/chatglm3_6b/fp16/1-gpu/rank0.engine...
Traceback (most recent call last):
File "/home/onatter/miniconda3/envs/tensorrt/bin/trtllm-build", line 8, in
Set flag --gpt_attention_plugin bfloat16, its work to me
my command trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp8 \ --output_dir ./engine_outputs \ --gemm_plugin float16 \ --strongly_typed \ --gpt_attention_plugin bfloat16 \ --workers 1
this is my log, oot@696da90ac847:/TensorRT-LLM/examples/llama/run/quantization/fp8# ./build.sh [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040900 [04/17/2024-16:36:51] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set gemm_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set nccl_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set lookup_plugin to None. [04/17/2024-16:36:51] [TRT-LLM] [I] Set lora_plugin to None. [04/17/2024-16:36:51] [TRT-LLM] [I] Set moe_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set context_fmha to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set paged_kv_cache to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set remove_input_padding to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set multi_block_mode to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set enable_xqa to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set tokens_per_block to 128. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set multiple_profiles to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set paged_state to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set streamingllm to False. [04/17/2024-16:36:51] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/17/2024-16:36:51] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[04/17/2024-16:36:55] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2145, GPU +0, now: CPU 2991, GPU 9747 (MiB)
[04/17/2024-16:37:00] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1789, GPU +316, now: CPU 4915, GPU 10065 (MiB)
[04/17/2024-16:37:00] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[04/17/2024-16:37:00] [TRT-LLM] [I] Set nccl_plugin to None.
[04/17/2024-16:37:00] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/17/2024-16:37:08] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[04/17/2024-16:37:08] [TRT] [W] Unused Input: position_ids
[04/17/2024-16:37:08] [TRT] [E] 9: [standardEngineBuilder.cpp::buildEngine::2266] Error Code 9: Internal Error (Networks with FP8 precision require hardware with FP8 support.)
[04/17/2024-16:37:08] [TRT-LLM] [E] Engine building failed, please check the error log.
[04/17/2024-16:37:08] [TRT] [I] Serialized 59 bytes of code generator cache.
[04/17/2024-16:37:08] [TRT] [I] Serialized 0 timing cache entries
[04/17/2024-16:37:08] [TRT-LLM] [I] Timing cache serialized to model.cache
[04/17/2024-16:37:08] [TRT-LLM] [I] Serializing engine to ./engine_outputs/rank0.engine...
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in
I guess it might just be that I don’t have enough CUDA memory.I worked with int8 weight only quantization.You can have a try.
my command trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp8 --output_dir ./engine_outputs --gemm_plugin float16 --strongly_typed --gpt_attention_plugin bfloat16 --workers 1
this is my log, oot@696da90ac847:/TensorRT-LLM/examples/llama/run/quantization/fp8# ./build.sh [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040900 [04/17/2024-16:36:51] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set gemm_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set nccl_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set lookup_plugin to None. [04/17/2024-16:36:51] [TRT-LLM] [I] Set lora_plugin to None. [04/17/2024-16:36:51] [TRT-LLM] [I] Set moe_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/17/2024-16:36:51] [TRT-LLM] [I] Set context_fmha to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set paged_kv_cache to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set remove_input_padding to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set multi_block_mode to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set enable_xqa to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set tokens_per_block to 128. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set multiple_profiles to False. [04/17/2024-16:36:51] [TRT-LLM] [I] Set paged_state to True. [04/17/2024-16:36:51] [TRT-LLM] [I] Set streamingllm to False. [04/17/2024-16:36:51] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size_max_input_len. It may not be optimal to set max_num_tokens=max_batch_size_max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/17/2024-16:36:51] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[04/17/2024-16:36:55] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2145, GPU +0, now: CPU 2991, GPU 9747 (MiB) [04/17/2024-16:37:00] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1789, GPU +316, now: CPU 4915, GPU 10065 (MiB) [04/17/2024-16:37:00] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading [04/17/2024-16:37:00] [TRT-LLM] [I] Set nccl_plugin to None. [04/17/2024-16:37:00] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/17/2024-16:37:08] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [04/17/2024-16:37:08] [TRT] [W] Unused Input: position_ids [04/17/2024-16:37:08] [TRT] [E] 9: [standardEngineBuilder.cpp::buildEngine::2266] Error Code 9: Internal Error (Networks with FP8 precision require hardware with FP8 support.) [04/17/2024-16:37:08] [TRT-LLM] [E] Engine building failed, please check the error log. [04/17/2024-16:37:08] [TRT] [I] Serialized 59 bytes of code generator cache. [04/17/2024-16:37:08] [TRT] [I] Serialized 0 timing cache entries [04/17/2024-16:37:08] [TRT-LLM] [I] Timing cache serialized to model.cache [04/17/2024-16:37:08] [TRT-LLM] [I] Serializing engine to ./engine_outputs/rank0.engine... Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 441, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 332, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 298, in build_and_save engine.save(output_dir) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 566, in save serialize_engine( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 105, in serialize_engine f.write(engine) TypeError: a bytes-like object is required, not 'NoneType'
You cannot build fp8 engine on hardware which does not support fp8.
System Info
3090 server
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python convert_checkpoint.py --model_dir ./tmp/llama/7B/ \ --output_dir ./tllm_checkpoint_1gpu_fp16 \ --dtype float16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 \ --output_dir ./tmp/llama/7B/trt_engines/fp16/1-gpu \ --gemm_plugin float16
Expected behavior
the engine directory should been build successfully
actual behavior
Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 440, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 332, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 298, in build_and_save
engine.save(output_dir)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 566, in save
serialize_engine(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 105, in serialize_engine
f.write(engine)
TypeError: a bytes-like object is required, not 'NoneType'
additional notes
the model we want to convert is not the original llama2. we have already done sft training