NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 937 forks source link

What are the advantages of int8_kv_cache? #1397

Closed sirodeneko closed 5 months ago

sirodeneko commented 6 months ago

About int8_kv_cache I did some tests:

Test model is mistral-7b My test inference code comes from run.py, supplementing runner.generate's time-consuming statistics,Added warm up code. Input length is 256, output length is 256

  • Comparative inference delay int8_kv_cache will bring a slight delay rise, after increasing the batch, the rise will be more obvious
5 batch=1 2 4 8 16 32 64
fp16 3.34 3.37 3.53 3.69 3.96 4.45 5.50
fp16-int8_kv_cache 3.45 3.49 3.67 3.83 4.18 4.75 5.98

trtllm-build --checkpoint_dir /opt/tiger/mixtral/tllm_checkpoint_mixtral_1gpu/fp16 \ --output_dir /opt/tiger/mixtral/trt_engines/fp16 \ --gemm_plugin float16 \ --max_batch_size 120

   -  FP16- int8_kv_cache A800-40G when batch_size = 120 will occur oom, even worse than fp16

python3 /opt/tiger/TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir /opt/tiger/mixtral/Mistral-7B-Instruct-v0.2 \ --output_dir /opt/tiger/mixtral/tllm_checkpoint_mixtral_1gpu/fp16_int8_kv_cache \ --dtype float16 \ --int8_kv_cache

trtllm-build --checkpoint_dir /opt/tiger/mixtral/tllm_checkpoint_mixtral_1gpu/fp16_int8_kv_cache \ --output_dir /opt/tiger/mixtral/trt_engines/fp16_int8_kv_cache \ --gemm_plugin float16 \ --strongly_typed \ --max_batch_size 120

byshiue commented 6 months ago

Could you share the full log of both FP16 and FP16-int8_kv_cache of engine building? In theory, they should use same activation memory and int8_kv_cache requires fewer memory footprint during runtime.

sirodeneko commented 6 months ago

fp16

<Trial 25029339 worker_0> tiger $ trtllm-build --checkpoint_dir /opt/tiger/mixtral/tllm_checkpoint_mixtral_1gpu/fp16 \
--output_dir /opt/tiger/mixtral/trt_engines/fp16 \
--gemm_plugin float16 \
--max_batch_size 32
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
[n122-137-008:36712] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024032600
[04/07/2024-15:13:14] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set gemm_plugin to float16.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set lookup_plugin to None.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set lora_plugin to None.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set moe_plugin to float16.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set context_fmha to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set remove_input_padding to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set multi_block_mode to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set enable_xqa to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set tokens_per_block to 128.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set multiple_profiles to False.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set paged_state to True.
[04/07/2024-15:13:14] [TRT-LLM] [I] Set streamingllm to False.
[04/07/2024-15:13:14] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[04/07/2024-15:13:14] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[04/07/2024-15:13:14] [TRT-LLM] [W] Fail to infer cluster key, use A100-SXM-80GB as fallback.
[04/07/2024-15:13:14] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 667, GPU 423 (MiB)
[04/07/2024-15:13:17] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2776, GPU 773 (MiB)
[04/07/2024-15:13:17] [TRT-LLM] [I] Set nccl_plugin to None.
[04/07/2024-15:13:17] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/1/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/1/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/1/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/1/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/2/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/2/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/2/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/2/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/3/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/3/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/3/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/3/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/4/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/4/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/4/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/4/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/5/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/5/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/5/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/5/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/5/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/5/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/5/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/5/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/6/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/6/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/6/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/6/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/6/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/6/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/6/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/6/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/7/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/7/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/7/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/7/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/7/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/7/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/7/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/7/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/8/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/8/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/8/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/8/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/8/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/8/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/8/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/8/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/9/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/9/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/9/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/9/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/9/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/9/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/9/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/9/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/10/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/10/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/10/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/10/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/10/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/10/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/10/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/10/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/11/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/11/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/11/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/11/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/11/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/11/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/11/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/11/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/12/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/12/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/12/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/12/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/12/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/12/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/12/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/12/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/13/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/13/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/13/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/13/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/13/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/13/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/13/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/13/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/14/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/14/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/14/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/14/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/14/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/14/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/14/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/14/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/15/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/15/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/15/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/15/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/15/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/15/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/15/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/15/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/16/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/16/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/16/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/16/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/16/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/16/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/16/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/16/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/17/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/17/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/17/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/17/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/17/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/17/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/17/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/17/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/18/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/18/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/18/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/18/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/18/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/18/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/18/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/18/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/19/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/19/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/19/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/19/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/19/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/19/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/19/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/19/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/20/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/20/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/20/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/20/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/20/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/20/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/20/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/20/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/21/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/21/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/21/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/21/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/21/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/21/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/21/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/21/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/22/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/22/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/22/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/22/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/22/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/22/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/22/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/22/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/23/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/23/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/23/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/23/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/23/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/23/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/23/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/23/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/24/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/24/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/24/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/24/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/24/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/24/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/24/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/24/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/25/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/25/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/25/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/25/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/25/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/25/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/25/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/25/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/26/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/26/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/26/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/26/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/26/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/26/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/26/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/26/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/27/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/27/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/27/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/27/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/27/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/27/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/27/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/27/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/28/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/28/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/28/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/28/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/28/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/28/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/28/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/28/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/29/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/29/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/29/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/29/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/29/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/29/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/29/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/29/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/30/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/30/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/30/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/30/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/30/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/30/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/30/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/30/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/31/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/31/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/31/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/31/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/31/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/31/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/31/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/31/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/ln_f/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[04/07/2024-15:13:17] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[04/07/2024-15:13:18] [TRT] [W] Unused Input: position_ids
[04/07/2024-15:13:18] [TRT] [W] Detected layernorm nodes in FP16.
[04/07/2024-15:13:18] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[04/07/2024-15:13:18] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[04/07/2024-15:13:18] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2816, GPU 799 (MiB)
[04/07/2024-15:13:18] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 2818, GPU 809 (MiB)
[04/07/2024-15:13:18] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/07/2024-15:13:18] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[04/07/2024-15:13:28] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[04/07/2024-15:13:28] [TRT] [I] Detected 13 inputs and 1 output network tensors.
[04/07/2024-15:13:34] [TRT] [I] Total Host Persistent Memory: 63184
[04/07/2024-15:13:34] [TRT] [I] Total Device Persistent Memory: 0
[04/07/2024-15:13:34] [TRT] [I] Total Scratch Memory: 537001984
[04/07/2024-15:13:34] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 557 steps to complete.
[04/07/2024-15:13:34] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 26.1125ms to assign 15 blocks to 557 nodes requiring 3120567296 bytes.
[04/07/2024-15:13:34] [TRT] [I] Total Activation Memory: 3120566272
[04/07/2024-15:13:34] [TRT] [I] Total Weights Memory: 14483464192
[04/07/2024-15:13:34] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2951, GPU 14639 (MiB)
[04/07/2024-15:13:34] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2952, GPU 14649 (MiB)
[04/07/2024-15:13:34] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/07/2024-15:13:34] [TRT] [I] Engine generation completed in 16.3055 seconds.
[04/07/2024-15:13:34] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 500 MiB, GPU 13813 MiB
[04/07/2024-15:13:34] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +13813, now: CPU 0, GPU 13813 (MiB)
[04/07/2024-15:13:41] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 31353 MiB
[04/07/2024-15:13:41] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:23
[04/07/2024-15:13:41] [TRT] [I] Serialized 59 bytes of code generator cache.
[04/07/2024-15:13:41] [TRT] [I] Serialized 169938 bytes of compilation cache.
[04/07/2024-15:13:41] [TRT] [I] Serialized 12 timing cache entries
[04/07/2024-15:13:41] [TRT-LLM] [I] Timing cache serialized to model.cache
[04/07/2024-15:13:41] [TRT-LLM] [I] Serializing engine to /opt/tiger/mixtral/trt_engines/fp16/rank0.engine...
[04/07/2024-15:13:55] [TRT-LLM] [I] Engine serialized. Total time: 00:00:13
[04/07/2024-15:13:56] [TRT-LLM] [I] Total time of building all engines: 00:00:41

int8_kv_cache

<Trial 25029339 worker_0> tiger $ trtllm-build --checkpoint_dir /opt/tiger/mixtral/tllm_checkpoint_mixtral_1gpu/fp16_int8_kv_cache \
--output_dir /opt/tiger/mixtral/trt_engines/fp16_int8_kv_cache \
--gemm_plugin float16 \
--strongly_typed \
--max_batch_size 32
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
[n122-137-008:36347] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024032600
[04/07/2024-15:10:59] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set gemm_plugin to float16.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set lookup_plugin to None.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set lora_plugin to None.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set moe_plugin to float16.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set context_fmha to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set remove_input_padding to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set multi_block_mode to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set enable_xqa to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set tokens_per_block to 128.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set multiple_profiles to False.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set paged_state to True.
[04/07/2024-15:10:59] [TRT-LLM] [I] Set streamingllm to False.
[04/07/2024-15:10:59] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[04/07/2024-15:10:59] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[04/07/2024-15:10:59] [TRT-LLM] [W] Fail to infer cluster key, use A100-SXM-80GB as fallback.
[04/07/2024-15:10:59] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 667, GPU 423 (MiB)
[04/07/2024-15:11:02] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2776, GPU 773 (MiB)
[04/07/2024-15:11:02] [TRT-LLM] [I] Set nccl_plugin to None.
[04/07/2024-15:11:02] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/07/2024-15:11:02] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[04/07/2024-15:11:02] [TRT] [W] Unused Input: position_ids
[04/07/2024-15:11:02] [TRT] [W] Detected layernorm nodes in FP16.
[04/07/2024-15:11:02] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[04/07/2024-15:11:02] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[04/07/2024-15:11:02] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2818, GPU 799 (MiB)
[04/07/2024-15:11:02] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2819, GPU 809 (MiB)
[04/07/2024-15:11:02] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/07/2024-15:11:02] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[04/07/2024-15:11:12] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[04/07/2024-15:11:12] [TRT] [I] Detected 13 inputs and 1 output network tensors.
[04/07/2024-15:11:17] [TRT] [I] Total Host Persistent Memory: 97520
[04/07/2024-15:11:17] [TRT] [I] Total Device Persistent Memory: 0
[04/07/2024-15:11:17] [TRT] [I] Total Scratch Memory: 537001984
[04/07/2024-15:11:17] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 654 steps to complete.
[04/07/2024-15:11:17] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 40.3158ms to assign 15 blocks to 654 nodes requiring 4026536960 bytes.
[04/07/2024-15:11:17] [TRT] [I] Total Activation Memory: 4026535936
[04/07/2024-15:11:17] [TRT] [I] Total Weights Memory: 14483478016
[04/07/2024-15:11:18] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2841, GPU 14635 (MiB)
[04/07/2024-15:11:18] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2841, GPU 14645 (MiB)
[04/07/2024-15:11:18] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/07/2024-15:11:18] [TRT] [I] Engine generation completed in 15.0914 seconds.
[04/07/2024-15:11:18] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 13813 MiB
[04/07/2024-15:11:18] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +13813, now: CPU 0, GPU 13813 (MiB)
[04/07/2024-15:11:25] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 31245 MiB
[04/07/2024-15:11:25] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:22
[04/07/2024-15:11:25] [TRT] [I] Serialized 59 bytes of code generator cache.
[04/07/2024-15:11:25] [TRT] [I] Serialized 126992 bytes of compilation cache.
[04/07/2024-15:11:25] [TRT] [I] Serialized 9 timing cache entries
[04/07/2024-15:11:25] [TRT-LLM] [I] Timing cache serialized to model.cache
[04/07/2024-15:11:25] [TRT-LLM] [I] Serializing engine to /opt/tiger/mixtral/trt_engines/fp16_int8_kv_cache/rank0.engine...
[04/07/2024-15:11:36] [TRT-LLM] [I] Engine serialized. Total time: 00:00:10
[04/07/2024-15:11:37] [TRT-LLM] [I] Total time of building all engines: 00:00:37

I have two questions.

byshiue commented 6 months ago

Is the delayed increase with expectations?

For same batch size, it might happen because we need to do futher dequantization on int8 kv cache.

Is it expected that the maximum batch allowed by reasoning has not increased?

It depends on the cache buffer size v.s. activation buffer size. You could try a test with input length 10, output length 502 (the total length is still 512). In such case, the activation buffer size is small and the ratio of kv cache buffer size would be larger. So, int8 kv cache should bring more explict benefit.

Hukongtao commented 6 months ago

Ask a question that has nothing to do with the topic. When I use int8_kv_cache, I got an error:

TypeError: a bytes-like object is required, not 'NoneType'

Have you ever encountered? https://github.com/NVIDIA/TensorRT-LLM/issues/1268 I can fix this problem by add --strongly_typed. But I'm still confused, it looks like you didn't use --strongly_typed

sirodeneko commented 6 months ago

Ask a question that has nothing to do with the topic. When I use int8_kv_cache, I got an error:

TypeError: a bytes-like object is required, not 'NoneType'

Have you ever encountered? #1268 I can fix this problem by add --strongly_typed. But I'm still confused, it looks like you didn't use --strongly_typed

I also used this parameter

trtllm-build --checkpoint_dir /opt/tiger/mixtral/tllm_checkpoint_mixtral_1gpu/fp16_int8_kv_cache \
--output_dir /opt/tiger/mixtral/trt_engines/fp16_int8_kv_cache \
--gemm_plugin float16 \
--strongly_typed \
--max_batch_size 120
byshiue commented 5 months ago

Using --strongly_typed is necessary under int8_kv_cache now.