NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.2k stars 909 forks source link

qwen2 72b output empty after quantize with smoothquart #2147

Open gloritygithub11 opened 3 weeks ago

gloritygithub11 commented 3 weeks ago

System Info

tensorrt 10.2.0 tensorrt_llm 0.12.0.dev2024072301 A100-80G * 4

Who can help?

@Tracin

Information

Tasks

Reproduction

  1. convert
    python3 ./convert_checkpoint.py --model_dir /share/huggingface/hub/models--Qwen--Qwen2-72B/snapshots/87993795c78576318087f70b43fbf530eb7789e7 --output_dir /models/tmp/qwen2-hf/72b/trt/sq-bs4-il3072/checkpoint --dtype float16 --smoothquant 0.5 --load_model_on_cpu

output:

[08/23/2024-04:12:55] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301
0.12.0.dev2024072301
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 37/37 [05:56<00:00,  9.65s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/dist-packages/datasets/load.py:1491: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
calibrating model:   0%|                                                                                              | 0/512 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
calibrating model: 100%|████████████████████████████████████████████████████████████████████████████████████| 512/512 [03:28<00:00,  2.45it/s]
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py:91: RuntimeWarning: overflow encountered in divide
  scale_w_orig_quant_c = 127. / act_range["w"].cpu().numpy()
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py:134: RuntimeWarning: invalid value encountered in multiply
  "weight.int8.col": to_i8(weights * scale_w_orig_quant_c),
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py:126: RuntimeWarning: invalid value encountered in cast
  to_i8 = lambda x: x.round().clip(-127, 127).astype(np.int8)
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py:91: RuntimeWarning: divide by zero encountered in divide
  scale_w_orig_quant_c = 127. / act_range["w"].cpu().numpy()
Weights loaded. Total time: 00:50:13
Total time of converting checkpoints: 01:02:02
  1. build
    trtllm-build --checkpoint_dir /models/tmp/qwen2-hf/72b/trt/sq-bs4-il3072/checkpoint --output_dir /models/tmp/qwen2-hf/72b/trt/sq-bs4-il3072/engine/llm_engines --gemm_plugin float16 --gpt_attention_plugin float16 --max_batch_size 4 --max_input_len 3072 --max_seq_len 5120

output

[08/23/2024-05:15:04] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301
[08/23/2024-05:15:06] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set gemm_plugin to float16.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set nccl_plugin to auto.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set lookup_plugin to None.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set lora_plugin to None.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set moe_plugin to auto.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set context_fmha to True.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set paged_kv_cache to True.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set remove_input_padding to True.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set reduce_fusion to False.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set enable_xqa to True.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set tokens_per_block to 64.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set multiple_profiles to False.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set paged_state to True.
[08/23/2024-05:15:06] [TRT-LLM] [I] Set streamingllm to False.
[08/23/2024-05:15:06] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
[08/23/2024-05:15:06] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[08/23/2024-05:15:06] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[08/23/2024-05:15:06] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[08/23/2024-05:15:06] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[08/23/2024-05:15:21] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[08/23/2024-05:15:21] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
...
[08/23/2024-05:15:21] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[08/23/2024-05:15:21] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[08/23/2024-05:15:21] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[08/23/2024-05:15:21] [TRT-LLM] [I] Set dtype to float16.
[08/23/2024-05:15:21] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 5266, GPU 416 (MiB)
[08/23/2024-05:15:23] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1931, GPU +354, now: CPU 7350, GPU 770 (MiB)
[08/23/2024-05:15:23] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[08/23/2024-05:15:23] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[08/23/2024-05:15:23] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[08/23/2024-05:15:23] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[08/23/2024-05:15:23] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[08/23/2024-05:15:23] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[08/23/2024-05:15:23] [TRT-LLM] [I] Set nccl_plugin to None.
[08/23/2024-05:15:24] [TRT-LLM] [I] Total optimization profiles added: 1
[08/23/2024-05:15:24] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[08/23/2024-05:15:24] [TRT] [W] Unused Input: position_ids
[08/23/2024-05:15:24] [TRT] [W] Detected layernorm nodes in FP16.
[08/23/2024-05:15:24] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[08/23/2024-05:15:27] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[08/23/2024-05:15:27] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[08/23/2024-05:15:29] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[08/23/2024-05:15:29] [TRT] [I] Detected 15 inputs and 1 output network tensors.
[08/23/2024-05:16:07] [TRT] [I] Total Host Persistent Memory: 450912
[08/23/2024-05:16:07] [TRT] [I] Total Device Persistent Memory: 0
[08/23/2024-05:16:07] [TRT] [I] Total Scratch Memory: 268468224
[08/23/2024-05:16:07] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1696 steps to complete.
[08/23/2024-05:16:07] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 98.5468ms to assign 17 blocks to 1696 nodes requiring 1588467712 bytes.
[08/23/2024-05:16:07] [TRT] [I] Total Activation Memory: 1588466688
[08/23/2024-05:16:07] [TRT] [I] Total Weights Memory: 75277049856
[08/23/2024-05:18:55] [TRT] [I] Engine generation completed in 207.638 seconds.
[08/23/2024-05:18:55] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 69 MiB, GPU 71790 MiB
[08/23/2024-05:19:21] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 151874 MiB
[08/23/2024-05:19:21] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:03:57
[08/23/2024-05:19:21] [TRT] [I] Serialized 26 bytes of code generator cache.
[08/23/2024-05:19:21] [TRT] [I] Serialized 102646 bytes of compilation cache.
[08/23/2024-05:19:21] [TRT] [I] Serialized 16 timing cache entries
[08/23/2024-05:19:21] [TRT-LLM] [I] Timing cache serialized to model.cache
[08/23/2024-05:19:21] [TRT-LLM] [I] Serializing engine to /models/tmp/botanistgpt/qwen2-hf/72b/trt/sq-bs4-il3072/engine/llm_engines/rank0.engine...
[08/23/2024-05:19:49] [TRT-LLM] [I] Engine serialized. Total time: 00:00:28
[08/23/2024-05:19:53] [TRT-LLM] [I] Total time of building all engines: 00:04:46
  1. test
    python3 ../run.py --input_text "Hello, what's your name?" \
                  --max_output_len=50 \
                  --tokenizer_dir=/share/huggingface/hub/models--Qwen--Qwen2-72B/snapshots/87993795c78576318087f70b43fbf530eb7789e7 \
                  --engine_dir=/models/tmp/qwen2-hf/72b/trt/sq-bs4-il3072/engine/llm_engines

output

[08/23/2024-06:26:18] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024072301 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024072301 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024072301 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 5120
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 5120
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 5119 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 71811 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1514.88 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 71789 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.82 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 12.17 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.15 GiB, available: 7.06 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 326
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 80
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.37 GiB for max tokens in paged KV cache (20864).
[08/23/2024-06:26:57] [TRT-LLM] [I] Load engine takes: 35.95686221122742 sec
Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello, what's your name?<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: ""

Expected behavior

the output is not empty and make sense

actual behavior

output is empty

additional notes

I tried the same command with qwen2 7b instruct, which works properly.

python3 ../run.py --input_text "Hello, what's your name?" \
                  --max_output_len=50 \
                  --tokenizer_dir=/share/huggingface/hub/models--Qwen--Qwen2-7B-Instruct/snapshots/41c66b0be1c3081f13defc6bdf946c2ef240d6a6 \
                  --engine_dir=/models/tmp/qwen2/7b/trt/sq-bs4-il3072/engine/llm_engines

output

[08/23/2024-06:29:41] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024072301 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024072301 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024072301 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 5120
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 5120
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 5119 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 8328 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 944.88 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8320 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.82 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 12.17 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.15 GiB, available: 69.64 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 18337
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 80
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 62.68 GiB for max tokens in paged KV cache (1173568).
[08/23/2024-06:29:48] [TRT-LLM] [I] Load engine takes: 4.277831792831421 sec
Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello, what's your name?<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: "As an AI, I don't have a name in the same sense that a human does. However, you can call me Assistant. I'm here to help you with any information or tasks you need assistance with. Just ask, and I'll do
jershi425 commented 1 week ago

Hi @gloritygithub11, it looks like a issue during checkpoint conversion. Did you convert Qwen-7B-Instruct using the same command as the 72B one? If not, could you please give me the command you use. It could help me locate the issue. At the mean time, since you have 4 GPU, could you try python3 ./convert_checkpoint.py --model_dir /your_model_dir --output_dir /your_output_dir --dtype float16 --smoothquant 0.5 --tp_size 4 instead? This should work for your case.

gloritygithub11 commented 1 week ago

Hi @jershi425,

I have the same command for 7b and 72b. Since I will run the model in a single gpu node, I can't build with tp_size 4

jershi425 commented 1 week ago

Hi @gloritygithub11 , sorry but currently we don't support single GPU deployment for the 72B model even with int8 SQ. It is hardly feasible because it requires 72GB + activations + KV caches + other buffers which will easily oom on single GPU.