Open Chenyangzh opened 11 months ago
Hi,I tested more and has following supplementary description of this issue.
"For LLaMA v2 70B, there is a restriction on tensor parallelism that the number of KV heads must be divisible by the number of GPUs. For example, since the 70B model has 8 KV heads, you can run it with 2, 4 or 8 GPUs (1 GPU as well for FP8)." from https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#llama-v2-updates Is that mean a model using multi-query attention can not apply tensor-parall? 1 KV-head is not divisable.
I am testing yayi2 which is a mq attention. https://huggingface.co/wenge-research/yayi2-30b
I also tried yi with gq attention, which only use less than 1/10 scratch space in build phase. https://huggingface.co/01-ai/Yi-34B-Chat
Any suggestions for mq-attention models?
I got the same issue when building qwen 72b model, and found this:
https://nvidia.github.io/TensorRT-LLM/memory.html#activation-size
According to the page above, add this build argument --enable_context_fmha
works for me.
hi do u still have further issue or question now? If not, we'll close it soon.
Enviroment: 1 * Nvidia-A100(80G) Docker version 20.10.24 tensorrt 9.2.0.post12.dev5 tensorrt-llm 0.7.1 torch-tensorrt 0.0.0
Build llama model following descriptions here https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md#long-context-length Script: python llama/build.py --model_dir /workspace/weight/ \ --output_dir ./tmp/llama/trt_engines/fp16/1-gpu/ \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --enable_context_fmha \ --multi_block_mode \ --world_size 4 \ --tp_size 4 \ --pp_size 1 \ --max_batch_size 1 \ --max_input_len 32768 \ --max_output_len 8192 \ --max_beam_width 1 \ --rotary_scaling dynamic 8 \ Warnings: [01/05/2024-10:58:35] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/layers/9/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/layers/10/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [01/05/2024-10:58:35] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/layers/10/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/layers/10/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [01/05/2024-10:58:35] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/layers/10/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/layers/10/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [01/05/2024-10:58:35] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/layers/10/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/layers/10/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [01/05/2024-10:58:35] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 Errors: [01/05/2024-11:37:46] [TRT] [E] 4: Internal error: plugin node LLaMAForCausalLM/layers/0/attention/PLUGIN_V2_GPTAttention_0 requires 1684659503360 bytes of scratch space, but only 85024112640 is available. Try increasing the workspace size with IBuilderConfig::setMemoryPoolLimit(). [01/05/2024-11:37:46] [TRT] [E] 4: [pluginV2Builder.cpp::makeRunner::519] Error Code 4: Internal Error (Internal error: plugin node LLaMAForCausalLM/layers/0/attention/PLUGIN_V2_GPTAttention_0 requires 1684659503360 bytes of scratch space, but only 85024112640 is available. Try increasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
Questions: