NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.79k stars 1.01k forks source link

Failing to inference multi-GPU Llama engine #802

Open manarshehadeh opened 11 months ago

manarshehadeh commented 11 months ago

Env:

Issue: Ensemble model is loaded successfully, but when inferencing it with HTTP request using cmd: curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'

Request fails with call stack: Assertion failed: input_ids: expected 2 dims, provided 1 dims (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:138)\n1 0x7f451f4697fd /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x177fd) [0x7f451f4697fd]\n2 0x7f451f5797d8 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1277d8) [0x7f451f5797d8]\n3 0x7f451f4cbeb1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x79eb1) [0x7f451f4cbeb1]\n4 0x7f451f4cd319 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7b319) [0x7f451f4cd319]\n5 0x7f451f4d0f0d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7ef0d) [0x7f451f4d0f0d]\n6 0x7f451f4bba28 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x69a28) [0x7f451f4bba28]\n7 0x7f451f4bffb5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6dfb5) [0x7f451f4bffb5]\n8 0x7f45b344f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f45b344f253]\n9 0x7f45b31dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f45b31dfac3]\n10 0x7f45b3271660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7f45b3271660]"}%

Should inference request for multiple-GPU engines work the same as single GPU?

wangyubo111 commented 10 months ago

I am facing the same issue for CodeLlama 34B Instruction model:

nv-guomingz commented 2 weeks ago

Hi @wangyubo111 Could u please try out latest release to see if this issue still exists or not? And do u still have further issue or question now? If not, we'll close it soon.