Closed ujjawal-ti closed 2 months ago
It's too slow in token generation that I'm getting InternalServerError due to 504: Gateway time-out.
Gateway time-out is not a problem related to llama.cpp, but is a problem related to your reverse proxy.
You should try other runtimes like vllm to see if you get the same problem or not.
Thanks @ngxson for pointing it out. It is resolved now.
What happened?
I'm trying to generate longer texts using Llama3-70B 8-bit quantized model hosted on A100 server (80GB GPU). It's too slow in token generation that I'm getting
InternalServerError
due to504: Gateway time-out
. It's working fine for less number (< 1k) of token generation. I observed the following after checking the server logs,I'm using the default settings with
llama.cpp:full-cuda
docker image. Any idea how to resolve the issue?Name and Version
I'm using the default settings with
llama.cpp:full-cuda
docker image.What operating system are you seeing the problem on?
Linux
Relevant log output