NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.64k stars 986 forks source link

Triton server failed to start: out of memory #260

Open sleepwalker2017 opened 1 year ago

sleepwalker2017 commented 1 year ago

The engine is ok using python to run offline inference with trt-llm.

But when I use triton to run it, it complains like following.

Why is this? The triton server uses more memory than TRT-LLM offline inference?

I'm using max_batch_size 24 on 2 V100 GPUs.

[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12569 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12569 (MiB)
E1103 02:59:01.152656 81041 backend_model.cc:553] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaSt
ream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:112)
1       0x7f5fe6888d85 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x35d85) [0x7f5fe6888d85]
2       0x7f5fe68e4f08 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x91f08) [0x7f5fe68e4f08]
3       0x7f5fe68d8590 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x85590) [0x7f5fe68d8590]
4       0x7f5fe69265a4 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xd35a4) [0x7f5fe69265a4]
5       0x7f5fe690947e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xb647e) [0x7f5fe690947e]
6       0x7f5fe68c80fe /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x750fe) [0x7f5fe68c80fe]
7       0x7f5fe68a9b03 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56b03) [0x7f5fe68a9b03]
8       0x7f5fe68a4335 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x51335) [0x7f5fe68a4335]
9       0x7f5fe68a221b /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f21b) [0x7f5fe68a221b]
10      0x7f5fe6885ec2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x32ec2) [0x7f5fe6885ec2]
11      0x7f5fe6885f75 TRITONBACKEND_ModelInstanceInitialize + 101
12      0x7f612fba4116 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a0116) [0x7f612fba4116]
13      0x7f612fba5356 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1356) [0x7f612fba5356]
14      0x7f612fb89bd5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x185bd5) [0x7f612fb89bd5]
15      0x7f612fb8a216 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x186216) [0x7f612fb8a216]
16      0x7f612fb9531d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19131d) [0x7f612fb9531d]
17      0x7f612f207f68 /lib/x86_64-linux-gnu/libc.so.6(+0x99f68) [0x7f612f207f68]
18      0x7f612fb81adb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17dadb) [0x7f612fb81adb]
19      0x7f612fb8f865 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b865) [0x7f612fb8f865]
20      0x7f612fb94682 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190682) [0x7f612fb94682]
21      0x7f612fc77230 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x273230) [0x7f612fc77230]
22      0x7f612fc7a923 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x276923) [0x7f612fc7a923]
23      0x7f612fdc3e52 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3bfe52) [0x7f612fdc3e52]
24      0x7f612f472253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f612f472253]
25      0x7f612f202b43 /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f612f202b43]
26      0x7f612f293bb4 clone + 68
E1103 02:59:01.152736 81041 model_lifecycle.cc:622] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAs
ync(ptr, n, mCudaStream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:112)
sleepwalker2017 commented 1 year ago

I rebuilt the engine using max_batch_size = 8, then triton server runs ok.

Why is that? I can't find any document that mentions that.

byshiue commented 1 year ago

The program may be really OOM when bs is 24. The larger max_batch_size, the more workspace is allocated by TensorRT.

Lzhang-hub commented 1 year ago

I have same error with chatglm2 model for max_batch_size=8, and max_input_len=4096 for signle gpu, default max_batch_size=8, and max_input_len=1024 in build.py is work fine.
One of the more confusing points is that run.py is work fine for max_batch_size=8, and max_input_len=4096.

sleepwalker2017 commented 1 year ago

The program may be really OOM when bs is 24. The larger max_batch_size, the more workspace is allocated by TensorRT.

Hi, I think there are two issues:

  1. the memory usage is larger than it really needs. For example, when I'm using FT, it supports batch 32, but on TensorRT, maybe only 24 is supported.
  2. The memory usage is even larger using triton server. Although the 24 batch is supported by trt-llm, it's not supported by triton backend. I think that's a problem. It's not consistent even between TRT-LLM and its trition backend. How can we decide a runnable batch size? For now, only by trying the batch sizes one by one.
hubble-bubble commented 1 year ago

I have the same question with chatglm2.

byshiue commented 11 months ago

In triton sever, we use the paged kv cache and it pre-allocate a large buffer pool to handle the paged kv cache. You can control the pool size by changing kv_cache_free_gpu_mem_fraction.

byshiue commented 11 months ago

Any update?

Burning-XX commented 10 months ago

so what we can do to solve it

Burning-XX commented 10 months ago

The engine is ok using python to run offline inference with trt-llm.

But when I use triton to run it, it complains like following.

Why is this? The triton server uses more memory than TRT-LLM offline inference?

I'm using max_batch_size 24 on 2 V100 GPUs.

[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12569 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12569 (MiB)
E1103 02:59:01.152656 81041 backend_model.cc:553] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaSt
ream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:112)
1       0x7f5fe6888d85 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x35d85) [0x7f5fe6888d85]
2       0x7f5fe68e4f08 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x91f08) [0x7f5fe68e4f08]
3       0x7f5fe68d8590 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x85590) [0x7f5fe68d8590]
4       0x7f5fe69265a4 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xd35a4) [0x7f5fe69265a4]
5       0x7f5fe690947e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xb647e) [0x7f5fe690947e]
6       0x7f5fe68c80fe /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x750fe) [0x7f5fe68c80fe]
7       0x7f5fe68a9b03 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56b03) [0x7f5fe68a9b03]
8       0x7f5fe68a4335 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x51335) [0x7f5fe68a4335]
9       0x7f5fe68a221b /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f21b) [0x7f5fe68a221b]
10      0x7f5fe6885ec2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x32ec2) [0x7f5fe6885ec2]
11      0x7f5fe6885f75 TRITONBACKEND_ModelInstanceInitialize + 101
12      0x7f612fba4116 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a0116) [0x7f612fba4116]
13      0x7f612fba5356 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1356) [0x7f612fba5356]
14      0x7f612fb89bd5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x185bd5) [0x7f612fb89bd5]
15      0x7f612fb8a216 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x186216) [0x7f612fb8a216]
16      0x7f612fb9531d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19131d) [0x7f612fb9531d]
17      0x7f612f207f68 /lib/x86_64-linux-gnu/libc.so.6(+0x99f68) [0x7f612f207f68]
18      0x7f612fb81adb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17dadb) [0x7f612fb81adb]
19      0x7f612fb8f865 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b865) [0x7f612fb8f865]
20      0x7f612fb94682 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190682) [0x7f612fb94682]
21      0x7f612fc77230 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x273230) [0x7f612fc77230]
22      0x7f612fc7a923 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x276923) [0x7f612fc7a923]
23      0x7f612fdc3e52 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3bfe52) [0x7f612fdc3e52]
24      0x7f612f472253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f612f472253]
25      0x7f612f202b43 /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f612f202b43]
26      0x7f612f293bb4 clone + 68
E1103 02:59:01.152736 81041 model_lifecycle.cc:622] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAs
ync(ptr, n, mCudaStream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:112)

Hi, did you find way to solve this out of memory problem? I also met it with llama-7b, how much gpu memory should I prepare to make it work when max_batch_size=9

byshiue commented 10 months ago

Please share your scripts to build engine, your configs to launch the server and the scripts to launch server.