NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.8k stars 1.01k forks source link

Build Long Input Length of Llama model Error:leak of scratch space #823

Open Chenyangzh opened 11 months ago

Chenyangzh commented 11 months ago

Enviroment: 1 * Nvidia-A100(80G) Docker version 20.10.24 tensorrt 9.2.0.post12.dev5 tensorrt-llm 0.7.1 torch-tensorrt 0.0.0

Build llama model following descriptions here https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md#long-context-length Script: python llama/build.py --model_dir /workspace/weight/ \ --output_dir ./tmp/llama/trt_engines/fp16/1-gpu/ \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --enable_context_fmha \ --multi_block_mode \ --world_size 4 \ --tp_size 4 \ --pp_size 1 \ --max_batch_size 1 \ --max_input_len 32768 \ --max_output_len 8192 \ --max_beam_width 1 \ --rotary_scaling dynamic 8 \ Warnings: [01/05/2024-10:58:35] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/layers/9/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/layers/10/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [01/05/2024-10:58:35] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/layers/10/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/layers/10/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [01/05/2024-10:58:35] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/layers/10/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/layers/10/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [01/05/2024-10:58:35] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/layers/10/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/layers/10/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [01/05/2024-10:58:35] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 Errors: [01/05/2024-11:37:46] [TRT] [E] 4: Internal error: plugin node LLaMAForCausalLM/layers/0/attention/PLUGIN_V2_GPTAttention_0 requires 1684659503360 bytes of scratch space, but only 85024112640 is available. Try increasing the workspace size with IBuilderConfig::setMemoryPoolLimit(). [01/05/2024-11:37:46] [TRT] [E] 4: [pluginV2Builder.cpp::makeRunner::519] Error Code 4: Internal Error (Internal error: plugin node LLaMAForCausalLM/layers/0/attention/PLUGIN_V2_GPTAttention_0 requires 1684659503360 bytes of scratch space, but only 85024112640 is available. Try increasing the workspace size with IBuilderConfig::setMemoryPoolLimit().

Questions:

  1. What is the scratch space used for? and how to calculate it for estimation? How to prevent the error while building with a long input lenth?
  2. The GPU mem usaged in runtime are explained well in llama tutorial page. 70B model should build with 8 world_size, which means it requires 8 card for inference. But It only cost 18GB each card.
  3. I found that opt model using checkpoint transfer process before buiding trt engine. Will that new pipline apply to all other models in future? If the 3-step pipline requires less scratch space in build process?
Chenyangzh commented 10 months ago

Hi,I tested more and has following supplementary description of this issue.

  1. "For LLaMA v2 70B, there is a restriction on tensor parallelism that the number of KV heads must be divisible by the number of GPUs. For example, since the 70B model has 8 KV heads, you can run it with 2, 4 or 8 GPUs (1 GPU as well for FP8)." from https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#llama-v2-updates Is that mean a model using multi-query attention can not apply tensor-parall? 1 KV-head is not divisable.

  2. I am testing yayi2 which is a mq attention. https://huggingface.co/wenge-research/yayi2-30b

  3. I also tried yi with gq attention, which only use less than 1/10 scratch space in build phase. https://huggingface.co/01-ai/Yi-34B-Chat

Any suggestions for mq-attention models?

ywx217 commented 9 months ago

I got the same issue when building qwen 72b model, and found this:

https://nvidia.github.io/TensorRT-LLM/memory.html#activation-size

According to the page above, add this build argument --enable_context_fmha works for me.

nv-guomingz commented 2 weeks ago

hi do u still have further issue or question now? If not, we'll close it soon.