Closed josephrocca closed 1 week ago
BTW, the expected behavior is of course that if a client aborts a request, that triggers the server to abort the inference. That's how OpenAI and TGI works (and also vLLM I think, but I haven't 100% confirmed that).
I added this:
if await raw_request.is_disconnected():
# Abort the request if the client disconnects.
await VariableInterface.async_engine.stop_session(
request.session_id)
return
within the loop here:
And it now gets to Batch 24 finished.
(instead of only 11) before stopping.
Here are the logs:
Hi, @josephrocca may remove https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/turbomind/turbomind.py#L526 line . We actually did try catch such exception inside the AsyncEngine. But somehow the bug was introduced by the line of sleep. In my testing, all the requests were finished after removing the line.
Yep, that fixed it! I removed the await asyncio.sleep(0.002)
from the async_cancel
function and it seems to work now. Thank you!
Checklist
Describe the bug
After a certain number of client-aborted requests (100-300, depending on the prompt and/or generated tokens size), some sort of leak causes the turbomind engine to stall.
Some early investigation of this issue was discussed in this comment and the comments below it.
Reproduction
Using an A100 with the latest official Docker image:
Run this command:
Then visit
http://0.0.0.0:3000
in your browser, then open the browser console withCtrl+Shift+i
, then paste the code below and press enter.Based on the logs, you'll see that it stalls after about 11 batches. I.e. the above script logs
Batch 11 finished.
and no more batches are finished after that. The Python server actually still responds to requests, but no tokens are generated by turbomind, so the requests just "hang".Environment
Latest official Docker image (
openmmlab/lmdeploy:v0.4.2
) on a A100 80G Runpod machine:Server / TurboMind logs:
See this comment and the ones below it for the
DEBUG
server logs that occur when the server stalls.