NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.62k stars 979 forks source link

Can int4_awq and w4a8_awq support deepseek? #1693

Open activezhao opened 5 months ago

activezhao commented 5 months ago

System Info

CPU x86_64

GPU NVIDIA L20

TensorRT branch: v0.8.0

CUDA: NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.3

Who can help?

@Tracin

Information

Tasks

Reproduction

I use TensorRT-LLM V0.8.0 to build the Docker Container, and try to convert deepseek-6.7b-base model using w4a8_awq, but I meet the following error.

"RuntimeError: Provided tensor names are different from those expected by the engine."

The command is:

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-6.7b-online-v2.1 \
                --dtype bfloat16 \
                --qformat w4a8_awq \
                --tp_size 2 \
                --awq_block_size 128 \
                --kv_cache_dtype fp8 \
                --output_dir /data/deepseek-6.7b-online-v2.1-w4a8-awq-tp2 \
                --calib_size 32
trtllm-build --checkpoint_dir /data/deepseek-6.7b-online-v2.1-w4a8-awq-tp2 \
             --output_dir /data/trt-engines-deepseek-6.7b-online-v2.1-w4a8-awq-tp2 \
             --workers 2 \
             --paged_kv_cache enable \
             --gpt_attention_plugin bfloat16 \
             --max_batch_size 64  \
             --gemm_plugin bfloat16

Expected behavior

Hope the command runs successfully.

actual behavior

The error is:

[TensorRT-LLM] TensorRT-LLM version: 0.8.0[05/22/2024-12:58:33] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set lookup_plugin to None.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set lora_plugin to None.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set context_fmha to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set remove_input_padding to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set multi_block_mode to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set enable_xqa to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set tokens_per_block to 128.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[05/22/2024-12:58:33] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 392, in build_and_save
    engine = build(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 272, in build
    model.load(weights)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 338, in load
    raise RuntimeError(err_msg)
RuntimeError: Provided tensor names are different from those expected by the engine.

additional notes

As I said above, can int4_awq and w4a8_awq support deepseek?

Thanks.

Barry-Delaney commented 5 months ago

Hi @activezhao, DeepSeek models are not supported yet, along with int4_awq and w4a8_awq for DeepSeek.

activezhao commented 5 months ago

Hi @activezhao, DeepSeek models are not supported yet, along with int4_awq and w4a8_awq for DeepSeek.

@Barry-Delaney OK, is there any plan to support?

Thanks

activezhao commented 5 months ago

Hi @Barry-Delaney I use the flowing command for FP8 quantization.

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-6.7b-online-v2.1 \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir /data/trt-deepseek6.7b-online-v2.1-2gpu-fp8-bz32 \
                                   --calib_size 512 \
                                   --tp_size 2

# Build trtllm engines from the trtllm checkpoint
trtllm-build --checkpoint_dir /data/trt-deepseek6.7b-online-v2.1-2gpu-fp8-bz32 \
             --output_dir /data/trt_engines-deepseek6.7b-online-v2.1-2gpu-fp8-bz32/2-gpu \
            --max_input_len 8192 \
            --max_output_len 1024 \
             --gemm_plugin float16 \
             --strongly_typed \
             --paged_kv_cache enable \
             --gpt_attention_plugin float16 \
             --max_batch_size 32  \
             --workers 2

These are the params of the request.

    "max_tokens": 256,
    "temperature": 0.2,
    "top_p": 0.95,
    "n": 1,
    "stream": true,
    "stop": ["\n"],
    "repetition_penalty": 1,

After using FP8 quantization, the latency has dropped and the throughput has improved.

But now I find that the Chinese in the inference results is garbled.

Is this caused by the decrease in FP8 accuracy? And is there a way to solve it?

Thanks

{
    "id": "",
    "model": "codewise-d1-t",
    "object": "text_completion",
    "created": 0,
    "choices": [
        {
            "index": 0,
            "text": "3: \"颗粒剂\", 4: \"注射剂\", 5: \"口服散剂\", 6: \"滴��剂\", 7: \"灌肠剂\", 8: \"��剂\", 9: \"缓释控释剂型\", 10: \"缓控释颗粒剂\", 11: \"乳膏剂\", 12: \"贴剂\", 13: \"外用冻干制剂\", 14: \"吸入剂\", 15: \"凝胶剂\", 16: \"片剂\", 17: \"局部用散剂\", 18: \"溶液剂\", 19: \"胶囊剂\", 20: \"胶��剂\"}\n",
            "logprobs": {
                "text_offset": [

                ],
                "token_logprobs": [

                ],
                "tokens": [

                ],
                "top_logprobs": [
                    {
                        "3": -0.0000010728841743912199
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "颗": -0.0000009536747711536009
                    },
                    {
                        "粒": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000013113030945532955
                    },
                    {
                        "4": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "注": -0.0000009536747711536009
                    },
                    {
                        "射": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000019073504518019035
                    },
                    {
                        "5": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "口": -0.0000009536747711536009
                    },
                    {
                        "服": -0.0000009536747711536009
                    },
                    {
                        "散": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000019073504518019035
                    },
                    {
                        "6": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "滴": -0.0000009536747711536009
                    },
                    {
                        "�": -0.0000009536747711536009
                    },
                    {
                        "�": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000010728841743912199
                    },
                    {
                        "7": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "灌": -0.0000009536747711536009
                    },
                    {
                        "肠": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000010728841743912199
                    },
                    {
                        "8": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "�": -0.0000009536747711536009
                    },
                    {
                        "�": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000014305124977909145
                    },
                    {
                        "9": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "缓": -0.0000009536747711536009
                    },
                    {
                        "释": -0.0000009536747711536009
                    },
                    {
                        "控": -0.0000009536747711536009
                    },
                    {
                        "释": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "型": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000011920935776288388
                    },
                    {
                        "1": -0.0000009536747711536009
                    },
                    {
                        "0": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "缓": -0.0000009536747711536009
                    },
                    {
                        "控": -0.0000009536747711536009
                    },
                    {
                        "释": -0.0000009536747711536009
                    },
                    {
                        "颗": -0.0000009536747711536009
                    },
                    {
                        "粒": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000009536747711536009
                    },
                    {
                        "1": -0.0000009536747711536009
                    },
                    {
                        "1": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "乳": -0.0000009536747711536009
                    },
                    {
                        "膏": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000009536747711536009
                    },
                    {
                        "1": -0.0000009536747711536009
                    },
                    {
                        "2": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "贴": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000009536747711536009
                    },
                    {
                        "1": -0.0000009536747711536009
                    },
                    {
                        "3": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "外": -0.0000009536747711536009
                    },
                    {
                        "用": -0.0000009536747711536009
                    },
                    {
                        "冻": -0.0000009536747711536009
                    },
                    {
                        "干": -0.0000009536747711536009
                    },
                    {
                        "制": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000009536747711536009
                    },
                    {
                        "1": -0.0000009536747711536009
                    },
                    {
                        "4": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "吸": -0.0000009536747711536009
                    },
                    {
                        "入": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000009536747711536009
                    },
                    {
                        "1": -0.0000009536747711536009
                    },
                    {
                        "5": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "凝": -0.0000009536747711536009
                    },
                    {
                        "胶": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000009536747711536009
                    },
                    {
                        "1": -0.0000009536747711536009
                    },
                    {
                        "6": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "片": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000009536747711536009
                    },
                    {
                        "1": -0.0000009536747711536009
                    },
                    {
                        "7": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "局": -0.0000009536747711536009
                    },
                    {
                        "部": -0.0000009536747711536009
                    },
                    {
                        "用": -0.0000009536747711536009
                    },
                    {
                        "散": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000009536747711536009
                    },
                    {
                        "1": -0.0000009536747711536009
                    },
                    {
                        "8": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "溶": -0.0000009536747711536009
                    },
                    {
                        "液": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000009536747711536009
                    },
                    {
                        "1": -0.0000009536747711536009
                    },
                    {
                        "9": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "胶": -0.0000009536747711536009
                    },
                    {
                        "囊": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\",": -0.0000009536747711536009
                    },
                    {
                        "": -0.0000009536747711536009
                    },
                    {
                        "2": -0.0000009536747711536009
                    },
                    {
                        "0": -0.0000009536747711536009
                    },
                    {
                        ":": -0.0000009536747711536009
                    },
                    {
                        "\"": -0.0000009536747711536009
                    },
                    {
                        "胶": -0.0000009536747711536009
                    },
                    {
                        "�": -0.0000009536747711536009
                    },
                    {
                        "�": -0.0000009536747711536009
                    },
                    {
                        "剂": -0.0000009536747711536009
                    },
                    {
                        "\"}": -0.08097536116838455
                    },
                    {
                        "\n": -0.0000009536747711536009
                    }
                ]
            },
            "finish_reason": ""
        }
    ],
    "usage": null
}
Barry-Delaney commented 5 months ago

Is there any plan to support?

There isn't ongoing work now. Please feel free to start another feature request in case you need.

Is this caused by the decrease in FP8 accuracy? And is there a way to solve it?

I see there are several categories of DeepSeek models, is your experiments based on models whose architectures == LlamaForCausalLM ?

activezhao commented 5 months ago

Is there any plan to support?

There isn't ongoing work now. Please feel free to start another feature request in case you need.

Is this caused by the decrease in FP8 accuracy? And is there a way to solve it?

I see there are several categories of DeepSeek models, is your experiments based on models whose architectures == LlamaForCausalLM ?

@Barry-Delaney The model is besed on deepseek-coder-6.7b-base

And the model‘s architectures is "LlamaForCausalLM".

activezhao commented 5 months ago

Is there any plan to support?

There isn't ongoing work now. Please feel free to start another feature request in case you need.

Is this caused by the decrease in FP8 accuracy? And is there a way to solve it?

I see there are several categories of DeepSeek models, is your experiments based on models whose architectures == LlamaForCausalLM ?

@Barry-Delaney Hi, barry, I also tested the non-quantified model and the same problem occurred.

So it’s caused by other reasons?

It's wired.

Barry-Delaney commented 5 months ago

One possible reason is the model you mentioned is using bfloat16 precision, and your command converts it into float16. Let me try to reproduce it.

activezhao commented 5 months ago

One possible reason is the model you mentioned is using bfloat16 precision, and your command converts it into float16. Let me try to reproduce it.

@Barry-Delaney OK, thanks, I will also try it.

activezhao commented 5 months ago

One possible reason is the model you mentioned is using bfloat16 precision, and your command converts it into float16. Let me try to reproduce it.

@Barry-Delaney The request is here, u can try it.

curl -X POST localhost:8000/v2/models/ensemble/generate_stream -d '{"text_input": "package gtin\n//2\n//外用液体剂\n//2018-08-15 16:12:50\n//3\n//颗粒剂\n//2018-08-15 16:12:50\n//4\n//注射剂\n//2018-08-15 16:12:50\n//5\n//口服散剂\n//2018-08-15 16:12:50\n//6\n//滴丸剂\n//2018-08-15 16:12:50\n//7\n//灌肠剂\n//2018-08-15 16:12:50\n//8\n//栓剂\n//2018-08-15 16:12:50\n//9\n//缓释控释剂型\n//2018-08-15 16:12:50\n//10\n//缓控释颗粒剂\n//2018-08-15 16:12:50\n//11\n//乳膏剂\n//2018-08-15 16:12:50\n//12\n//贴剂\n//2018-08-15 16:12:50\n//13\n//外用冻干制剂\n//2018-08-15 16:12:50\n//14\n//吸入剂\n//2018-08-15 16:12:50\n//15\n//凝胶剂\n//2018-08-15 16:12:50\n//16\n//片剂\n//2018-08-15 16:12:50\n//17\n//局部用散剂\n//2018-08-15 16:12:50\n//18\n//溶液剂\n//2018-08-15 16:12:50\n//19\n//胶囊剂\n//2018-08-22 17:49:54\n//20\n//胶丸剂\n//2018-12-20 15:20:56\n\n// DosageFormMap 剂型\nvar DosageFormMap = map[int]string{1: \"口服常释剂型\", 2: \"外用液体剂\", ", "max_tokens": 50, "bad_words": "", "stop_words": "", "stream": true, "temperature": 0.6, "return_log_probs": true, "generation_logits": true}'
activezhao commented 5 months ago

One possible reason is the model you mentioned is using bfloat16 precision, and your command converts it into float16. Let me try to reproduce it.

@Barry-Delaney I created an issue in trt_backend, and handoku suggests to use BLS to solve this problem.

https://github.com/triton-inference-server/tensorrtllm_backend/issues/493

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."