Closed viktor-ferenczi closed 10 months ago
Thanks
I've tried the "fix" mentioned in #375
It results in CUDA runtime error: out of memory
after handling only 6 conversations.
(2x4090, 2x24GB VRAM, 13B model in 16 bits)
So far I cannot get lmDeploy to work reliably.
It is not a fix for your case. What you need is to set renew_session
to True. https://github.com/InternLM/lmdeploy/blob/55764e0b33d8b9298f68b77484bab3832696c010/lmdeploy/serve/openai/api_server.py#L97
If you still want to use the above fix
with random instance_id, you should set instance_num
with a smaller value for example
python -m lmdeploy.serve.openai.api_server --instance_num 1 --tp=2 --server_name=0.0.0.0 ./workspace
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.
This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.
Checklist
Describe the bug
The OpenAI compatible AI server returns empty completions after answering correctly for a few times.
In my latest test it responded correctly 15 times before it broke. After that it just immediately responds with a 200 OK and an empty completion. The GPU load goes back to zero, so it does not do anything. It can be resolved only by restarting the server, which makes it unusable. My expectation would be that the server handles as many conversations as its batch size allows in parallel and hold up the rest of HTTP connections until a free slot is available.
The server does not print any errors on the console and there is no verbose or debug flag either.
Reproduction
Deploy any Llama2 or derived model, then start the OpenAI API server.
For example in my case of two 4090 GPUs:
Python script to drive the server via the OpenAI async client protocol:
Error traceback
No response