Failing to inference multi-GPU Llama engine

manarshehadeh commented 11 months ago

Env:

Container: nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3
TensorRT-LLM release: 0.7.1
TRT-LLM backend repo tag: v0.7.1
Model: Llama-2-70b
tritonserver deployed on 2 A100 GPUs

Issue: Ensemble model is loaded successfully, but when inferencing it with HTTP request using cmd: curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'

Request fails with call stack: Assertion failed: input_ids: expected 2 dims, provided 1 dims (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:138)\n1 0x7f451f4697fd /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x177fd) [0x7f451f4697fd]\n2 0x7f451f5797d8 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1277d8) [0x7f451f5797d8]\n3 0x7f451f4cbeb1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x79eb1) [0x7f451f4cbeb1]\n4 0x7f451f4cd319 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7b319) [0x7f451f4cd319]\n5 0x7f451f4d0f0d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7ef0d) [0x7f451f4d0f0d]\n6 0x7f451f4bba28 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x69a28) [0x7f451f4bba28]\n7 0x7f451f4bffb5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6dfb5) [0x7f451f4bffb5]\n8 0x7f45b344f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f45b344f253]\n9 0x7f45b31dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f45b31dfac3]\n10 0x7f45b3271660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7f45b3271660]"}%

Should inference request for multiple-GPU engines work the same as single GPU?

wangyubo111 commented 10 months ago

I am facing the same issue for CodeLlama 34B Instruction model:

Container: nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3
Build the engine using TensorRT-LLM release: 0.7.1
tritonserver deployed on 4 V100 GPUs

nv-guomingz commented 2 weeks ago

Hi @wangyubo111 Could u please try out latest release to see if this issue still exists or not? And do u still have further issue or question now? If not, we'll close it soon.

NVIDIA / TensorRT-LLM

Failing to inference multi-GPU Llama engine #802