NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.18k stars 906 forks source link

Problem with Qwen2-7B-Instruct inference after quantization in FP8 #2007

Open VladislavDuma opened 1 month ago

VladislavDuma commented 1 month ago

System Info

OS:
Operating System: Ubuntu 20.04.6 LTS
          Kernel: Linux 5.4.0-146-generic
    Architecture: x86-64

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              1
Core(s) per socket:              8
Socket(s):                       2
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
Stepping:                        6
CPU MHz:                         1995.312
BogoMIPS:                        3990.62
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        64 MiB
L3 cache:                        32 MiB

GPU:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                       On | 00000000:06:10.0 Off |                    0 |
| N/A   28C    P8               16W /  75W|      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA L4                       On | 00000000:06:11.0 Off |                    0 |
| N/A   29C    P0               19W /  75W|      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|

Library:

Information

Tasks

Reproduction

  1. Move to TensorRT-LLM/examples/qwen folder
  2. Quantize Qwen2-7B-Instruct model
    python ../quantization/quantize.py --model_dir  {path_to_model}/Qwen2-7B-Instruct --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir {path_to_quantization}/Qwen2-7B-Instruct_1gpu_fp8 --calib_size 512 --tp_size 1
  3. To config.json add this fields:
    "moe_num_experts": 0,
    "moe_top_k": 0,
    "moe_normalization_mode": 0

    Then use trtllm-build with quantization checkpoint

    trtllm-build --checkpoint_dir {path_to_quantization}/Qwen2-7B-Instruct_1gpu_fp8 --output_dir {path_to_engine}/Qwen2-7B-Instruct_1gpu_fp8_engine --gemm_plugin auto --workers 1
  4. Try to generate output
    
    from tensorrt_llm import LLM, SamplingParams

model_id = f'{path_to_engine}/Qwen2-7B-Instruct_1gpu_fp8_engine/' tokenizer_id = f'{path_to_model}/Qwen2-7B-Instruct'

llm = LLM( model=model_id, tokenizer=tokenizer_id )

prompts = [ f'### INPUT:\nThe GeForce RTX 4090 is an enthusiast-class graphics card by NVIDIA, launched on September 20th, 2022.' f' Built on the 5 nm process, and based on the AD102 graphics processor, in its AD102-300-A1 variant, the card ' f'supports DirectX 12 Ultimate. This ensures that all modern games will run on GeForce RTX 4090. Additionally, the ' f'DirectX 12 Ultimate capability guarantees support for hardware-raytracing, variable-rate shading and more, in ' f'upcoming video games. The AD102 graphics processor is a large chip with a die area of 609 mm² and 76,300 million ' f'transistors. Unlike the fully unlocked TITAN Ada, which uses the same GPU but has all 18432 shaders enabled, ' f'NVIDIA has disabled some shading units on the GeForce RTX 4090 to reach the product\'s target shader count. ' f'It features 16384 shading units, 512 texture mapping units, and 176 ROPs. Also included are 512 tensor cores which' f' help improve the speed of machine learning applications. The card also has 128 raytracing acceleration cores. ' f'NVIDIA has paired 24 GB GDDR6X memory with the GeForce RTX 4090, which are connected using a 384-bit memory ' f'interface. The GPU is operating at a frequency of 2235 MHz, which can be boosted up to 2520 MHz, memory is ' f'running at 1313 MHz (21 Gbps effective).\n\n### INSTRUCTIONS:\nWhat is a RTX 4090?\n\n### OUTPUT:\n' ]

sampling_params = SamplingParams( temperature=0, max_new_tokens=128, top_k=20, top_p=0.5, repetition_penalty=1.1 ) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f'{prompt}') print(f'{generated_text}')



### Expected behavior

Output without quantization (FP16):
> The GeForce RTX 4090 is a high-performance graphics card designed by NVIDIA, released on September 20th, 2022. It utilizes the AD102 graphics processor, built on a 5 nanometer process, and supports DirectX 12 Ultimate, ensuring compatibility with modern games and enabling advanced features like hardware-raytracing and variable-rate shading. With 16,384 shading units, 512 texture mapping units, and 176 raster operation pipelines, it offers powerful processing capabilities. The card includes 512 tensor cores for AI applications and 1

### actual behavior

Output with FP8 quantization:
> The GeForce RTX 400 is an enthusiast-class graphics card by NVIDIA, launched on Seetemberber 20ndg, 2022. 
It is built on the 5 nm proccess, and basaed on the AD11000 graphics processor, in in, in its ad102--300-a1 variante,n, thee ccard soppors DirectX 12 Ultimateiate. Thhis ensures that thaatt all moden gaames will will r rung on Geeforce R RTXX X4t 4

### additional notes

I tried to quantize `Qwen2` to speed up inference, but the result has a lot of errors. I know that `Qwen` and `Qwen2` are slightly different, but I tried to get a quantized FP8 model based on the available scripts.

Are there any plans to add support for `Qwen2` in the near future?
QiJune commented 1 month ago

Hi @VladislavDuma , Qwen2 is not officially supported yet, we are still working on the verification. BTW, "Output without quantization (FP16)" is the result of TRT-LLM or HF? We need to make sure the sampling configs of FP16 and FP8 are the same.

github-actions[bot] commented 17 hours ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."