NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.12k stars 896 forks source link

trtllm-build qwen2 0.5B failed #1967

Open wenshuai-xiaomi opened 1 month ago

wenshuai-xiaomi commented 1 month ago

[07/17/2024-01:56:09] [TRT] [E] Error Code: 4: Internal error: plugin node QWenForCausalLM/transformer/layers/0/attention/wrapper/gpt_attention/PLUGIN_V2_GPTAttention_0 requires 26927499520 bytes of scratch space, but only 15642329088 is available. Try increasing the workspace size with IBuilderConfig::setMemoryPoolLimit().

[07/17/2024-01:56:09] [TRT] [E] IBuilder::buildSerializedNetwork: Error Code 4: Internal Error (Internal error: plugin node QWenForCausalLM/transformer/layers/0/attention/wrapper/gpt_attention/PLUGIN_V2_GPTAttention_0 requires 26927499520 bytes of scratch space, but only 15642329088 is available. Try increasing the workspace size with IBuilderConfig::setMemoryPoolLimit(). ) [07/17/2024-01:56:09] [TRT-LLM] [E] Engine building failed, please check the error log.

The issue is generated on g4t4 with 16G memory. It works on V100 with 32G memory. How to fix it on this GPU with 16G memory?

wenshuai-xiaomi commented 1 month ago

By the way, on trtllm 0.9 version, with the patch of https://github.com/Franc-Z/QWen1.5_TensorRT-LLM and some small change, the engine file can be generated and work well on g4t4 with16G memory.

QiJune commented 1 month ago

Feel free to reopen it if you have further questions.

wenshuai-xiaomi commented 1 month ago

why close the issue?

I just meat it work on 0.9 version, but failed on the newest code.