NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.31k stars 932 forks source link

QWenForCausalLM/transformer/vocab_embedding/embedding/GATHER_O_output_0: tensor volume exceeds 2147483647, dimensions are [num tokens,8192] #2204

Open zymy-chen opened 3 weeks ago

zymy-chen commented 3 weeks ago

System Info

GPU Name: 8 * H20 TensorRT-LLM : 0.11.0 NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.4

Who can help?

No response

Information

Tasks

Reproduction

qwen2-72B,batch_size=30,input_len =8192 ,output_len=512 It works fine when running FP16, but FP8 comes up tensor volume exceeds 2147483647

Expected behavior

Image

actual behavior

trtllm-build --checkpoint_dir ./tllm_checkpoint_fp8--output_dir ./8-gpu/ --gemm_plugin float16 --max_batch_size 64 --max_input_len 256 --max_output_len 256 --remove_input_padding enable --gpt_attention_plugin float16 --context_fmha enable --workers 8 --tp_size 8 --paged_kv_cache enable --use_paged_context_fmha enable --use_fused_mlp

additional notes

None

jershi425 commented 1 week ago

Hi @zymy-chen, this is a known issue for TRT version less than v10.1. Could you please check what is your TRT version?

zymy-chen commented 1 week ago

你好@zymy-chen,这是 TRT 版本低于 v10.1 的已知问题。您能检查一下您的 TRT 版本吗?

TRT version is 10.2.0 Image

lfr-0531 commented 1 week ago

Can you try to rerun using the latest version of TensorRT-LLM to see if this issue persists?

zymy-chen commented 3 days ago

Can you try to rerun using the latest version of TensorRT-LLM to see if this issue persists?

I try to TensorRT-LLM v0.12.0,but problem still exists