What are the advantages of int8_kv_cache?

5	batch=1	2	4	8	16	32	64
fp16	3.34	3.37	3.53	3.69	3.96	4.45	5.50
fp16-int8_kv_cache	3.45	3.49	3.67	3.83	4.18	4.75	5.98

fp16

<Trial 25029339 worker_0> tiger $ trtllm-build --checkpoint_dir /opt/tiger/mixtral/tllm_checkpoint_mixtral_1gpu/fp16 \
--output_dir /opt/tiger/mixtral/trt_engines/fp16 \
--gemm_plugin float16 \
--max_batch_size 32
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
[n122-137-008:36712] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024032600
[04/07/2024-15:13:14] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set gemm_plugin to float16.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set lookup_plugin to None.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set lora_plugin to None.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set moe_plugin to float16.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set context_fmha to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set remove_input_padding to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set multi_block_mode to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set enable_xqa to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set tokens_per_block to 128.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set multiple_profiles to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set paged_state to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set streamingllm to False.
[04/07/2024-15:13:14] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[04/07/2024-15:13:14] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[04/07/2024-15:13:14] [TRT-LLM] [W] Fail to infer cluster key, use A100-SXM-80GB as fallback.
[04/07/2024-15:13:14] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 667, GPU 423 (MiB)
[04/07/2024-15:13:17] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2776, GPU 773 (MiB)
[04/07/2024-15:13:17] [TRT-LLM] [I] Set nccl_plugin to None.
[04/07/2024-15:13:17] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/1/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/1/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/1/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/1/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/2/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/2/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/2/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/2/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/3/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/3/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/3/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/3/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/4/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/4/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/4/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/4/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/5/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/5/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/5/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/5/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/5/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/5/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/5/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/5/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/6/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/6/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/6/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/6/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/6/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/6/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/6/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/6/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/7/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/7/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/7/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/7/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/7/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/7/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/7/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/7/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/8/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/8/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/8/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/8/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/8/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/8/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/8/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/8/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/9/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/9/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/9/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/9/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/9/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/9/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/9/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/9/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/10/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/10/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/10/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/10/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/10/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/10/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/10/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/10/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/11/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/11/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/11/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/11/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/11/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/11/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/11/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/11/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/12/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/12/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/12/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/12/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/12/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/12/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/12/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/12/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/13/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/13/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/13/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/13/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/13/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/13/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/13/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/13/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/14/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/14/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/14/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/14/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/14/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/14/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/14/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/14/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/15/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/15/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/15/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/15/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/15/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/15/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/15/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/15/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/16/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/16/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/16/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/16/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/16/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/16/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/16/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/16/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/17/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/17/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/17/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/17/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/17/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/17/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/17/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/17/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/18/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/18/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/18/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/18/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/18/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/18/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/18/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/18/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/19/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/19/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/19/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/19/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/19/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/19/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/19/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/19/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/20/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/20/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/20/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/20/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/20/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/20/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/20/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/20/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/21/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/21/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/21/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/21/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/21/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/21/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/21/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/21/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/22/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/22/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/22/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/22/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/22/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/22/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/22/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/22/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/23/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/23/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/23/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/23/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/23/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/23/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/23/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/23/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/24/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/24/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/24/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/24/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/24/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/24/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/24/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/24/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/25/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/25/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/25/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/25/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/25/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/25/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/25/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/25/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/26/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/26/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/26/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/26/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/26/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/26/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/26/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/26/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/27/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/27/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/27/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/27/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/27/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/27/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/27/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/27/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/28/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/28/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/28/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/28/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/28/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/28/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/28/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/28/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/29/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/29/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/29/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/29/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/29/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/29/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/29/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/29/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/30/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/30/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/30/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/30/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/30/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/30/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/30/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/30/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/31/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/31/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/31/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/31/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/31/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/31/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/31/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/31/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/ln_f/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[04/07/2024-15:13:18] [TRT] [W] Unused Input: position_ids
[04/07/2024-15:13:18] [TRT] [W] Detected layernorm nodes in FP16.
[04/07/2024-15:13:18] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[04/07/2024-15:13:18] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[04/07/2024-15:13:18] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2816, GPU 799 (MiB)
[04/07/2024-15:13:18] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 2818, GPU 809 (MiB)
[04/07/2024-15:13:18] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/07/2024-15:13:18] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[04/07/2024-15:13:28] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[04/07/2024-15:13:28] [TRT] [I] Detected 13 inputs and 1 output network tensors.
[04/07/2024-15:13:34] [TRT] [I] Total Host Persistent Memory: 63184
[04/07/2024-15:13:34] [TRT] [I] Total Device Persistent Memory: 0
[04/07/2024-15:13:34] [TRT] [I] Total Scratch Memory: 537001984
[04/07/2024-15:13:34] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 557 steps to complete.
[04/07/2024-15:13:34] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 26.1125ms to assign 15 blocks to 557 nodes requiring 3120567296 bytes.
[04/07/2024-15:13:34] [TRT] [I] Total Activation Memory: 3120566272
[04/07/2024-15:13:34] [TRT] [I] Total Weights Memory: 14483464192
[04/07/2024-15:13:34] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2951, GPU 14639 (MiB)
[04/07/2024-15:13:34] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2952, GPU 14649 (MiB)
[04/07/2024-15:13:34] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/07/2024-15:13:34] [TRT] [I] Engine generation completed in 16.3055 seconds.
[04/07/2024-15:13:34] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 500 MiB, GPU 13813 MiB
[04/07/2024-15:13:34] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +13813, now: CPU 0, GPU 13813 (MiB)
[04/07/2024-15:13:41] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 31353 MiB
[04/07/2024-15:13:41] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:23
[04/07/2024-15:13:41] [TRT] [I] Serialized 59 bytes of code generator cache.
[04/07/2024-15:13:41] [TRT] [I] Serialized 169938 bytes of compilation cache.
[04/07/2024-15:13:41] [TRT] [I] Serialized 12 timing cache entries
[04/07/2024-15:13:41] [TRT-LLM] [I] Timing cache serialized to model.cache
[04/07/2024-15:13:41] [TRT-LLM] [I] Serializing engine to /opt/tiger/mixtral/trt_engines/fp16/rank0.engine...
[04/07/2024-15:13:55] [TRT-LLM] [I] Engine serialized. Total time: 00:00:13
[04/07/2024-15:13:56] [TRT-LLM] [I] Total time of building all engines: 00:00:41

int8_kv_cache

<Trial 25029339 worker_0> tiger $ trtllm-build --checkpoint_dir /opt/tiger/mixtral/tllm_checkpoint_mixtral_1gpu/fp16_int8_kv_cache \
--output_dir /opt/tiger/mixtral/trt_engines/fp16_int8_kv_cache \
--gemm_plugin float16 \
--strongly_typed \
--max_batch_size 32
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
[n122-137-008:36347] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024032600
[04/07/2024-15:10:59] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set gemm_plugin to float16.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set lookup_plugin to None.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set lora_plugin to None.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set moe_plugin to float16.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set context_fmha to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set remove_input_padding to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set multi_block_mode to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set enable_xqa to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set tokens_per_block to 128.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set multiple_profiles to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set paged_state to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set streamingllm to False.
[04/07/2024-15:10:59] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[04/07/2024-15:10:59] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[04/07/2024-15:10:59] [TRT-LLM] [W] Fail to infer cluster key, use A100-SXM-80GB as fallback.
[04/07/2024-15:10:59] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 667, GPU 423 (MiB)
[04/07/2024-15:11:02] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2776, GPU 773 (MiB)
[04/07/2024-15:11:02] [TRT-LLM] [I] Set nccl_plugin to None.
[04/07/2024-15:11:02] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/07/2024-15:11:02] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[04/07/2024-15:11:02] [TRT] [W] Unused Input: position_ids
[04/07/2024-15:11:02] [TRT] [W] Detected layernorm nodes in FP16.
[04/07/2024-15:11:02] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[04/07/2024-15:11:02] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[04/07/2024-15:11:02] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2818, GPU 799 (MiB)
[04/07/2024-15:11:02] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2819, GPU 809 (MiB)
[04/07/2024-15:11:02] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/07/2024-15:11:02] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[04/07/2024-15:11:12] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[04/07/2024-15:11:12] [TRT] [I] Detected 13 inputs and 1 output network tensors.
[04/07/2024-15:11:17] [TRT] [I] Total Host Persistent Memory: 97520
[04/07/2024-15:11:17] [TRT] [I] Total Device Persistent Memory: 0
[04/07/2024-15:11:17] [TRT] [I] Total Scratch Memory: 537001984
[04/07/2024-15:11:17] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 654 steps to complete.
[04/07/2024-15:11:17] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 40.3158ms to assign 15 blocks to 654 nodes requiring 4026536960 bytes.
[04/07/2024-15:11:17] [TRT] [I] Total Activation Memory: 4026535936
[04/07/2024-15:11:17] [TRT] [I] Total Weights Memory: 14483478016
[04/07/2024-15:11:18] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2841, GPU 14635 (MiB)
[04/07/2024-15:11:18] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2841, GPU 14645 (MiB)
[04/07/2024-15:11:18] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/07/2024-15:11:18] [TRT] [I] Engine generation completed in 15.0914 seconds.
[04/07/2024-15:11:18] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 13813 MiB
[04/07/2024-15:11:18] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +13813, now: CPU 0, GPU 13813 (MiB)
[04/07/2024-15:11:25] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 31245 MiB
[04/07/2024-15:11:25] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:22
[04/07/2024-15:11:25] [TRT] [I] Serialized 59 bytes of code generator cache.
[04/07/2024-15:11:25] [TRT] [I] Serialized 126992 bytes of compilation cache.
[04/07/2024-15:11:25] [TRT] [I] Serialized 9 timing cache entries
[04/07/2024-15:11:25] [TRT-LLM] [I] Timing cache serialized to model.cache
[04/07/2024-15:11:25] [TRT-LLM] [I] Serializing engine to /opt/tiger/mixtral/trt_engines/fp16_int8_kv_cache/rank0.engine...
[04/07/2024-15:11:36] [TRT-LLM] [I] Engine serialized. Total time: 00:00:10
[04/07/2024-15:11:37] [TRT-LLM] [I] Total time of building all engines: 00:00:37

I have two questions.

Is the delayed increase with expectations?
Is it expected that the maximum batch allowed by reasoning has not increased?

NVIDIA / TensorRT-LLM

What are the advantages of int8_kv_cache? #1397

fp16

int8_kv_cache