NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.15k stars 897 forks source link

[Bug] llama3.1-8b smoothquant error (use latest version: 5fa9436) #2025

Open fan-niu opened 1 month ago

fan-niu commented 1 month ago

System Info

System Info

GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions: https://github.com/NVIDIA/TensorRT-LLM.git (5fa9436) (latest version) https://github.com/triton-inference-server/tensorrtllm_backend ( a6aa8eb)

Who can help?

No response

Information

Tasks

Reproduction

step1: convert to smoothquant model python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct --output_dir Meta-Llama-3.1-8B-Instruct-smqout --dtype float16 --smoothquant 0.5 --per_token --per_channel --tp_size 1

step2: convert to engine trtllm-build --checkpoint_dir Meta-Llama-3.1-8B-Instruct-smqout \ --output_dir Meta-Llama-3.1-8B-Instruct-smqout-trtengine \ --remove_input_padding enable \ --context_fmha enable \ --gemm_plugin float16 \ --paged_kv_cache enable \ --max_num_tokens 65536 \ --max_batch_size 32 \ --max_input_len 32768 \ --gpt_attention_plugin float16

Expected behavior

Successfully converted engine

actual behavior

Error log: [TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301 0.12.0.dev2024072301 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.49it/s] /usr/local/lib/python3.10/dist-packages/datasets/load.py:1491: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail You can avoid this message in future by passing the argumenttrust_remote_code=True. Passingtrust_remote_code=Truewill be mandatory to load this dataset from the next major release ofdatasets. warnings.warn( Downloading builder script: 100%|██████████████████████████████████████████████████████████████████| 9.27k/9.27k [00:00<00:00, 44.9MB/s] Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████| 13.9k/13.9k [00:00<00:00, 71.3MB/s] Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████| 159M/159M [00:01<00:00, 156MB/s] Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████| 376M/376M [00:02<00:00, 163MB/s] Downloading data: 2.11MB [00:00, 121MB/s] Downloading data: 46.4MB [00:00, 127MB/s] Downloading data: 2.43MB [00:00, 125MB/s] Generating train split: 287113 examples [00:28, 10025.76 examples/s] Generating validation split: 13368 examples [00:01, 10251.71 examples/s] Generating test split: 11490 examples [00:01, 9442.46 examples/s] calibrating model: 0%| | 0/512 [00:00<?, ?it/s]We detected that you are passingpast_key_valuesas a tuple and this is deprecated and will be removed in v4.43. Please use an appropriateCache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache) calibrating model: 100%|██████████████████████████████████████████████████████████████████████████████| 512/512 [00:26<00:00, 19.61it/s] Weights loaded. Total time: 00:00:04 Total time of converting checkpoints: 00:02:15 [TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301 [07/25/2024-08:48:18] [TRT-LLM] [I] Set bert_attention_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [07/25/2024-08:48:18] [TRT-LLM] [I] Set gemm_plugin to float16. [07/25/2024-08:48:18] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set nccl_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set lookup_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set lora_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set moe_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set context_fmha to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set paged_kv_cache to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set remove_input_padding to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set reduce_fusion to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set enable_xqa to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set tokens_per_block to 64. [07/25/2024-08:48:18] [TRT-LLM] [I] Set use_paged_context_fmha to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set multiple_profiles to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set paged_state to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set streamingllm to False. [07/25/2024-08:48:18] [TRT-LLM] [W] max_seq_len is scaled to 1048576.0 by rotary scaling 8.0 [07/25/2024-08:48:18] [TRT-LLM] [I] max_seq_len is not specified, using value 1048576.0 [07/25/2024-08:48:18] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[07/25/2024-08:48:18] [TRT-LLM] [W] Specifying a max_num_tokens larger than 16384 is usually not recommended, we do not expect perf gain with that and too large max_num_tokens could possibly exceed the TensorRT tensor volume, causing runtime errors. Got max_num_tokens = 65536 [07/25/2024-08:48:18] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 535, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 371, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 338, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 307, in build_model model = model_cls.from_checkpoint(ckpt_dir, config=rank_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 428, in from_checkpoint model = cls(config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 364, in call obj.post_init() File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 380, in post_init quantize(self, self.config.quantization) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 276, in quantize model = smooth_quantize(model, quant_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 187, in smooth_quantize return smooth_quantize_plugin(model, quant_config.quant_mode) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 174, in smooth_quantize_plugin quant_layer = quant_cls(**init_params) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/layers.py", line 1456, in init rotary_embedding_scaling["type"]) KeyError: 'type'`

additional notes

No

fan-niu commented 1 month ago

@kaiyux Can you provide some help on this issue? Because need to use the smoothquant strategy on the A100 machine, thank you very much

byshiue commented 1 month ago

Thank you for the report. We can reproduce the issue and will fix it soon.

If you hope to fix it locally, you can refer self.rotary_embedding_scale_type setting of tensorrt_llm/layers/attention.py https://github.com/NVIDIA/TensorRT-LLM/blob/a681853d3803ee5893307e812530b5e7004bb6e1/tensorrt_llm/layers/attention.py#L402-L408 and fix it in tensorrt_llm/quantization/layers.py https://github.com/NVIDIA/TensorRT-LLM/blob/a681853d3803ee5893307e812530b5e7004bb6e1/tensorrt_llm/quantization/layers.py#L1454-L1459

Besides, there are two additional notes: 1) You use float16. As far as I know, most llama 3.1 models are trained by bfloat16. It has some accuracy risks to use float16 to run inference. 2) You don't setup the max_seq_len, which might lead to very long default max_seq_len and lead to OOM.

fan-niu commented 1 month ago

@byshiue Thank you very much, will verify this feature locally

manu-web commented 1 month ago

Did this fix work? Did not work for me.

fan-niu commented 1 month ago

Did this fix work? Did not work for me.

Already tested, it's good

github-actions[bot] commented 4 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."