NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 936 forks source link

LLama3 sq(per_token + per_channel) build failed on main branch #1694

Open NaNAGISaSA opened 4 months ago

NaNAGISaSA commented 4 months ago

System Info

Who can help?

@Tracin

Information

Tasks

Reproduction

model_name=llama3_8b hf_model_dir=/some-path-to-Meta-Llama-3-8B-Instruct convert_model_dir=/some-path trt_engine_dir=/some-path tp_size=2 dtype=sq

python3 examples/llama/convert_checkpoint.py --model_dir ${hf_model_dir} \ --tp_size ${tp_size} \ --workers ${tp_size} \ --smoothquant 0.5 \ --per_token \ --per_channel \ --dtype bfloat16 \ --output_dir ${convert_model_dir}/${dtype}/${tp_size}-gpu/

trtllm-build --checkpoint_dir ${convert_model_dir}/${dtype}/${tp_size}-gpu/ \ --output_dir ${trt_engine_dir}/${dtype}/${tp_size}-gpu/ \ --gemm_plugin bfloat16 \ --gpt_attention_plugin bfloat16 \ --context_fmha_fp32_acc enable \ --remove_input_padding enable \ --multi_block_mode enable \ --paged_kv_cache disable \ --workers ${tp_size} \ --max_batch_size 8 \ --max_input_len 512 \ --max_output_len 512

Expected behavior

build success

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100 0.11.0.dev2024052100 Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.07s/it] /usr/local/lib/python3.10/dist-packages/datasets/load.py:1486: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail You can avoid this message in future by passing the argument trust_remote_code=True. Passing trust_remote_code=True will be mandatory to load this dataset from the next major release of datasets. warnings.warn( Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. calibrating model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [01:15<00:00, 6.78it/s] Weights loaded. Total time: 00:00:04 Weights loaded. Total time: 00:00:04 Total time of converting checkpoints: 00:03:49 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100 [05/29/2024-07:13:28] [TRT-LLM] [I] Set bert_attention_plugin to float16. [05/29/2024-07:13:28] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16. [05/29/2024-07:13:28] [TRT-LLM] [I] Set gemm_plugin to bfloat16. [05/29/2024-07:13:28] [TRT-LLM] [I] Set nccl_plugin to float16. [05/29/2024-07:13:28] [TRT-LLM] [I] Set lookup_plugin to None. [05/29/2024-07:13:28] [TRT-LLM] [I] Set lora_plugin to None. [05/29/2024-07:13:28] [TRT-LLM] [I] Set moe_plugin to float16. [05/29/2024-07:13:28] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [05/29/2024-07:13:28] [TRT-LLM] [I] Set context_fmha to True. [05/29/2024-07:13:28] [TRT-LLM] [I] Set context_fmha_fp32_acc to True. [05/29/2024-07:13:28] [TRT-LLM] [I] Set paged_kv_cache to False. [05/29/2024-07:13:28] [TRT-LLM] [I] Set remove_input_padding to True. [05/29/2024-07:13:28] [TRT-LLM] [I] Set use_custom_all_reduce to True. [05/29/2024-07:13:28] [TRT-LLM] [I] Set multi_block_mode to True. [05/29/2024-07:13:28] [TRT-LLM] [I] Set enable_xqa to True. [05/29/2024-07:13:28] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [05/29/2024-07:13:28] [TRT-LLM] [I] Set tokens_per_block to 64. [05/29/2024-07:13:28] [TRT-LLM] [I] Set use_paged_context_fmha to False. [05/29/2024-07:13:28] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [05/29/2024-07:13:28] [TRT-LLM] [I] Set multiple_profiles to False. [05/29/2024-07:13:28] [TRT-LLM] [I] Set paged_state to True. [05/29/2024-07:13:28] [TRT-LLM] [I] Set streamingllm to False. [05/29/2024-07:13:28] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [05/29/2024-07:13:28] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100 [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT [05/29/2024-07:13:34] [TRT] [I] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 265, GPU 421 (MiB) [05/29/2024-07:13:34] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 261, GPU 421 (MiB) [05/29/2024-07:13:37] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1939, GPU +348, now: CPU 2340, GPU 769 (MiB) [05/29/2024-07:13:37] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [05/29/2024-07:13:37] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16. [05/29/2024-07:13:37] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16. [05/29/2024-07:13:37] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16. [05/29/2024-07:13:37] [TRT-LLM] [I] Set quantize_per_token_plugin to True. [05/29/2024-07:13:37] [TRT-LLM] [I] Set quantize_tensor_plugin to True. [05/29/2024-07:13:37] [TRT-LLM] [I] Set nccl_plugin to bfloat16. [05/29/2024-07:13:37] [TRT-LLM] [I] Set use_custom_all_reduce to True. [05/29/2024-07:13:37] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1939, GPU +348, now: CPU 2336, GPU 769 (MiB) [05/29/2024-07:13:37] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/attention/dense/CAST_1_output_0: first input has type BFloat16 but second input has type Half. [05/29/2024-07:13:37] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/mlp/ELEMENTWISE_PROD_1_output_0 and LLaMAForCausalLM/transformer/layers/0/mlp/CAST_0_output_0: first input has type Half but second input has type BFloat16. [05/29/2024-07:13:37] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/mlp/proj/CAST_1_output_0: first input has type BFloat16 but second input has type Half. [05/29/2024-07:13:37] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/1/attention/dense/CAST_1_output_0: first input has type BFloat16 but second input has type Half. [05/29/2024-07:13:37] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [05/29/2024-07:13:37] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/mlp/ELEMENTWISE_PROD_1_output_0 and LLaMAForCausalLM/transformer/layers/1/mlp/CAST_0_output_0: first input has type Half but second input has type BFloat16. [05/29/2024-07:13:37] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16. [05/29/2024-07:13:37] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16. [05/29/2024-07:13:37] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16. [05/29/2024-07:13:37] [TRT-LLM] [I] Set quantize_per_token_plugin to True. [05/29/2024-07:13:37] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/1/mlp/proj/CAST_1_output_0: first input has type BFloat16 but second input has type Half. [05/29/2024-07:13:37] [TRT-LLM] [I] Set quantize_tensor_plugin to True. [05/29/2024-07:13:37] [TRT-LLM] [I] Set nccl_plugin to bfloat16. [05/29/2024-07:13:37] [TRT-LLM] [I] Set use_custom_all_reduce to True. [05/29/2024-07:13:37] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/2/attention/dense/CAST_1_output_0: first input has type BFloat16 but second input has type Half. [05/29/2024-07:13:37] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/mlp/ELEMENTWISE_PROD_1_output_0 and LLaMAForCausalLM/transformer/layers/2/mlp/CAST_0_output_0: first input has type Half but second input has type BFloat16. [05/29/2024-07:13:37] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/2/mlp/proj/CAST_1_output_0: first input has type BFloat16 but second input has type Half.

... too long to submit issue, omitting some logs

[05/29/2024-07:13:40] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [05/29/2024-07:13:40] [TRT] [W] Unused Input: position_ids [05/29/2024-07:13:40] [TRT] [E] 4: [network.cpp::validate::3399] Error Code 4: Internal Error (fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder) [05/29/2024-07:13:40] [TRT-LLM] [E] Engine building failed, please check the error log. [05/29/2024-07:13:40] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [05/29/2024-07:13:40] [TRT] [W] Unused Input: position_ids [05/29/2024-07:13:40] [TRT] [E] 4: [network.cpp::validate::3399] Error Code 4: Internal Error (fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder) [05/29/2024-07:13:40] [TRT-LLM] [E] Engine building failed, please check the error log. [05/29/2024-07:13:40] [TRT] [I] Serialized 26 bytes of code generator cache. [05/29/2024-07:13:40] [TRT] [I] Serialized 0 timing cache entries [05/29/2024-07:13:40] [TRT-LLM] [I] Timing cache serialized to model.cache [05/29/2024-07:13:41] [TRT-LLM] [I] Total time of building all engines: 00:00:12

additional notes

try to add --strongly_typed gives another error message:

[05/29/2024-07:27:45] [TRT] [E] 4: [elementWiseNode.cpp::validateTypes::30] Error Code 4: Internal Error ((Unnamed Layer* 21) [ElementWise]: operation SUM must have same input types BFloat16 and Half)
[05/29/2024-07:27:45] [TRT] [E] 4: [graphShapeAnalyzer.cpp::needTypeAndDimensions::2276] Error Code 4: Internal Error (LLaMAForCausalLM/transformer/layers/0/post_layernorm/PLUGIN_V2_RmsnormQuantization_0: output shape can not be computed)
[05/29/2024-07:27:45] [TRT] [E] 4: [graphShapeAnalyzer.cpp::needTypeAndDimensions::2276] Error Code 4: Internal Error (LLaMAForCausalLM/transformer/layers/0/mlp/fc/PLUGIN_V2_SmoothQuantGemm_0: output shape can not be computed)
nv-guomingz commented 4 months ago

Hi @NaNAGISaSA thanks for reporting this issue, I can reproduce your issue on my local env.

One quick war is to set the dtype to float16 instead of bfloat16.

The rootcause of this issue is the smoothquant gemm plugin doesn't support bf16 type at this moment. If you upgrade the tensorrt-llm to f430a4b447ef4cba22698902d43eae0debf08594 , there's more clear prompt for the issue. [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: Support for bf16 is missing (/home/proj/TensorRT-LLM/cpp/tensorrt_llm/plugins/smoothQuantGemmPlugin/smoothQuantGemmPlugin.cpp:125)

DayDayupupupup commented 4 months ago

@NaNAGISaSA Have you calculated ppl? --smoothquant 0.5 --per_token --per_channel --dtype float16 --tp 1

--gemm_plugin float16 --max_input_len 2048 --gather_context_logits

Using above settings, TRTLLM 0.9 , I got Meta-Llama-3-8B-Instruct-sq 's PPL = 625 Meta-Llama-3-8B-Instruct's PPL = 8.29

nv-guomingz commented 4 months ago

Hi @NaNAGISaSA . we've filed a bug to track smoothquant gemm plugin supporting for bf16 internaly, would u mind we close this ticket at this moment?

nv-guomingz commented 4 months ago

Hi @NaNAGISaSA , we've fixed issue that smoothquant gemm plugin doesn't support bf16, please check next time tensorrt llm update(in weekly bias)

NaNAGISaSA commented 3 months ago

Hello, @nv-guomingz, thank you for your reply. I have tested on v0.10.0 branch, using the container (make -C docker release_build on v0.10.0 branch). The build command is same, but the same error occurred.

Build command:

model_name=llama3_8b
hf_model_dir=/some-path-to-Meta-Llama-3-8B-Instruct
convert_model_dir=/some-path
trt_engine_dir=/some-path
tp_size=2
dtype=sq

python3 examples/llama/convert_checkpoint.py --model_dir ${hf_model_dir}
--tp_size ${tp_size}
--workers ${tp_size}
--smoothquant 0.5
--per_token
--per_channel
--dtype bfloat16
--output_dir ${convert_model_dir}/${dtype}/${tp_size}-gpu/

trtllm-build --checkpoint_dir ${convert_model_dir}/${dtype}/${tp_size}-gpu/
--output_dir ${trt_engine_dir}/${dtype}/${tp_size}-gpu/
--gemm_plugin bfloat16
--gpt_attention_plugin bfloat16
--context_fmha_fp32_acc enable
--remove_input_padding enable
--multi_block_mode enable
--paged_kv_cache disable
--workers ${tp_size}
--max_batch_size 8
--max_input_len 512
--max_output_len 512

Error message:

[06/21/2024-07:25:39] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[06/21/2024-07:25:39] [TRT] [W] Unused Input: position_ids
[06/21/2024-07:25:39] [TRT] [E] 4: [network.cpp::validate::3399] Error Code 4: Internal Error (fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder)
[06/21/2024-07:25:39] [TRT-LLM] [E] Engine building failed, please check the error log.
[06/21/2024-07:25:39] [TRT] [I] Serialized 26 bytes of code generator cache.
[06/21/2024-07:25:39] [TRT] [I] Serialized 0 timing cache entries
[06/21/2024-07:25:39] [TRT-LLM] [I] Timing cache serialized to model.cache

Can you please check again, or tell me if my build command is wrong.