Open sleepwalker2017 opened 1 year ago
I rebuilt the engine using max_batch_size = 8, then triton server runs ok.
Why is that? I can't find any document that mentions that.
The program may be really OOM when bs is 24. The larger max_batch_size
, the more workspace is allocated by TensorRT.
I have same error with chatglm2
model for max_batch_size=8
, and max_input_len=4096
for signle gpu, default max_batch_size=8
, and max_input_len=1024
in build.py
is work fine.
One of the more confusing points is that run.py
is work fine for max_batch_size=8
, and max_input_len=4096
.
The program may be really OOM when bs is 24. The larger
max_batch_size
, the more workspace is allocated by TensorRT.
Hi, I think there are two issues:
I have the same question with chatglm2.
In triton sever, we use the paged kv cache and it pre-allocate a large buffer pool to handle the paged kv cache. You can control the pool size by changing kv_cache_free_gpu_mem_fraction
.
Any update?
so what we can do to solve it
The engine is ok using python to run offline inference with trt-llm.
But when I use triton to run it, it complains like following.
Why is this? The triton server uses more memory than TRT-LLM offline inference?
I'm using max_batch_size 24 on 2 V100 GPUs.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12569 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12569 (MiB) E1103 02:59:01.152656 81041 backend_model.cc:553] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaSt ream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:112) 1 0x7f5fe6888d85 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x35d85) [0x7f5fe6888d85] 2 0x7f5fe68e4f08 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x91f08) [0x7f5fe68e4f08] 3 0x7f5fe68d8590 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x85590) [0x7f5fe68d8590] 4 0x7f5fe69265a4 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xd35a4) [0x7f5fe69265a4] 5 0x7f5fe690947e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xb647e) [0x7f5fe690947e] 6 0x7f5fe68c80fe /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x750fe) [0x7f5fe68c80fe] 7 0x7f5fe68a9b03 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56b03) [0x7f5fe68a9b03] 8 0x7f5fe68a4335 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x51335) [0x7f5fe68a4335] 9 0x7f5fe68a221b /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f21b) [0x7f5fe68a221b] 10 0x7f5fe6885ec2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x32ec2) [0x7f5fe6885ec2] 11 0x7f5fe6885f75 TRITONBACKEND_ModelInstanceInitialize + 101 12 0x7f612fba4116 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a0116) [0x7f612fba4116] 13 0x7f612fba5356 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1356) [0x7f612fba5356] 14 0x7f612fb89bd5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x185bd5) [0x7f612fb89bd5] 15 0x7f612fb8a216 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x186216) [0x7f612fb8a216] 16 0x7f612fb9531d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19131d) [0x7f612fb9531d] 17 0x7f612f207f68 /lib/x86_64-linux-gnu/libc.so.6(+0x99f68) [0x7f612f207f68] 18 0x7f612fb81adb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17dadb) [0x7f612fb81adb] 19 0x7f612fb8f865 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b865) [0x7f612fb8f865] 20 0x7f612fb94682 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190682) [0x7f612fb94682] 21 0x7f612fc77230 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x273230) [0x7f612fc77230] 22 0x7f612fc7a923 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x276923) [0x7f612fc7a923] 23 0x7f612fdc3e52 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3bfe52) [0x7f612fdc3e52] 24 0x7f612f472253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f612f472253] 25 0x7f612f202b43 /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f612f202b43] 26 0x7f612f293bb4 clone + 68 E1103 02:59:01.152736 81041 model_lifecycle.cc:622] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAs ync(ptr, n, mCudaStream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:112)
Hi, did you find way to solve this out of memory problem? I also met it with llama-7b, how much gpu memory should I prepare to make it work when max_batch_size=9
Please share your scripts to build engine, your configs to launch the server and the scripts to launch server.
The engine is ok using python to run offline inference with trt-llm.
But when I use triton to run it, it complains like following.
Why is this? The triton server uses more memory than TRT-LLM offline inference?
I'm using max_batch_size 24 on 2 V100 GPUs.