NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.82k stars 1.01k forks source link

Support INT4 Weights, FP8 Activations in TRT-LLM #1312

Open ttim opened 8 months ago

ttim commented 8 months ago

System Info

H100, Llama 70B

Who can help?

No response

Information

Tasks

Reproduction

  1. Use Llama model api from TRT LLM
  2. Try to use AMMO's INT4 Weights, FP8 Activations quantization

Expected behavior

Should be possible

actual behavior

Impossible

additional notes

Not a bug, but a feature request

Barry-Delaney commented 8 months ago

@ttim you can use INT4 weights & FP8 activations with the following steps:

  1. Update your local repo to the latest main and build with --cuda_architectures "90-real".
  2. Follow INT4 AWQ instructions and replace the qformat with w4a8_awq for quantize.py.
  3. For enabling W4A8_AWQ along with FP8 kv cache, please add --kv_cache_dtype fp8 for the quantize.py command.
felixslu commented 8 months ago

@Barry-Delaney ,I got an error when build w4a8_awq engines of llama7b with trtllm-build tools. My TRT-LLM version is v0.8.0. (by the way,the quantization stage is OK when use quantize.py script)

"RuntimeError: Provided tensor names are different from those expected by the engine"

Could you give me some advices? Is this a bug?

Barry-Delaney commented 8 months ago

@felixslu This error appears when the generated checkpoints' names are not converted correctly. This is fixed in the latest main branch.

activezhao commented 6 months ago

@felixslu This error appears when the generated checkpoints' names are not converted correctly. This is fixed in the latest main branch.

Hi @Barry-Delaney Is this issue fixed in v8.0?

Because I meet the same error which is "RuntimeError: Provided tensor names are different from those expected by the engine."

The command is:

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-6.7b-online-v2.1 \
                --dtype bfloat16 \
                --qformat w4a8_awq \
                --tp_size 2 \
                --awq_block_size 128 \
                --kv_cache_dtype fp8 \
                --output_dir /data/deepseek-6.7b-online-v2.1-w4a8-awq-tp2 \
                --calib_size 32

trtllm-build --checkpoint_dir /data/deepseek-6.7b-online-v2.1-w4a8-awq-tp2 \
             --output_dir /data/trt-engines-deepseek-6.7b-online-v2.1-w4a8-awq-tp2 \
             --workers 2 \
             --paged_kv_cache enable \
             --gpt_attention_plugin bfloat16 \
             --max_batch_size 64  \
             --gemm_plugin bfloat16

The error is:

[TensorRT-LLM] TensorRT-LLM version: 0.8.0[05/22/2024-12:58:33] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set lookup_plugin to None.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set lora_plugin to None.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set context_fmha to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set remove_input_padding to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set multi_block_mode to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set enable_xqa to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set tokens_per_block to 128.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[05/22/2024-12:58:33] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 392, in build_and_save
    engine = build(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 272, in build
    model.load(weights)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 338, in load
    raise RuntimeError(err_msg)
RuntimeError: Provided tensor names are different from those expected by the engine.
Barry-Delaney commented 6 months ago

@activezhao This is because Deepseek is not supported yet.

activezhao commented 6 months ago

@activezhao This is because Deepseek is not supported yet.

@Barry-Delaney OK, is there any plan to support it?

We are currently using fp8, and the performance is great. Now we want to try w4a8.

W4A8 does not support Deepseek, how about Int4 AWQ (W4A16)?

Thanks.

Barry-Delaney commented 6 months ago

@activezhao Thanks for the feedback! INT4-AWQ is also not supported as the Deepseek model is not implemented yet. Please start a feature request about Deepseek model incase you need the support.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."