[Bug] Client-aborted streaming requests 'leak', which eventually stalls/crashes turbomind after 100 to 300 requests

josephrocca commented 2 weeks ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.

Describe the bug

After a certain number of client-aborted requests (100-300, depending on the prompt and/or generated tokens size), some sort of leak causes the turbomind engine to stall.

Some early investigation of this issue was discussed in this comment and the comments below it.

Reproduction

Using an A100 with the latest official Docker image:

openmmlab/lmdeploy:v0.4.2

Run this command:

lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --server-port 3000 --tp 1 --session-len 4096 --model-format awq --model-name lmdeploy/llama2-chat-70b-4bit --log-level INFO

Then visit http://0.0.0.0:3000 in your browser, then open the browser console with Ctrl+Shift+i, then paste the code below and press enter.

async function doRequest() {
  let abortController = new AbortController();
  let response = await fetch("http://0.0.0.0:3000/v1/completions", {
    method: 'POST',
    signal: abortController.signal,
    headers: { 'content-type': 'application/json' },
    body: JSON.stringify({
      model: "lmdeploy/llama2-chat-70b-4bit",
      max_tokens: 512,
      stream: true,
      prompt: `USER: Write a story about a cat.\nASSISTANT: Once upon a`,
    }),
  });
  let reader = response.body.getReader();
  await reader.read(); // wait for first chunk of text
  abortController.abort(); // abort the request
}

// Send 100 batches of 10 requests:
for(let i = 0; i < 100; i++) {
  let requestPromises = [];
  for(let j = 0; j < 10; j++) {
    requestPromises.push( doRequest() );
  }
  await Promise.all(requestPromises);
  console.log(`Batch ${i} finished.`);
}

Based on the logs, you'll see that it stalls after about 11 batches. I.e. the above script logs Batch 11 finished. and no more batches are finished after that. The Python server actually still responds to requests, but no tokens are generated by turbomind, so the requests just "hang".

Environment

Latest official Docker image (openmmlab/lmdeploy:v0.4.2) on a A100 80G Runpod machine:

sys.platform: linux
Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100 80GB PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.16.0+cu118
LMDeploy: 0.4.2+54b7230
transformers: 4.41.1
gradio: 3.50.2
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.1.0

Server / TurboMind logs:

See this comment and the ones below it for the DEBUG server logs that occur when the server stalls.

josephrocca commented 2 weeks ago

BTW, the expected behavior is of course that if a client aborts a request, that triggers the server to abort the inference. That's how OpenAI and TGI works (and also vLLM I think, but I haven't 100% confirmed that).

josephrocca commented 2 weeks ago

I added this:

if await raw_request.is_disconnected():
  # Abort the request if the client disconnects.
  await VariableInterface.async_engine.stop_session(
      request.session_id)
  return

within the loop here:

https://github.com/InternLM/lmdeploy/blob/54b7230b4ca08b37b85e5f6e1960e2445dca5288/lmdeploy/serve/openai/api_server.py#L778-L780

And it now gets to Batch 24 finished. (instead of only 11) before stopping.

Here are the logs:

Click for logs

``` [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 1, id = 236 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 4, id = 240 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 0, id = 235 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 2, id = 239 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 1, id = 238 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 0, id = 237 INFO: 100.64.0.22:52490 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52502 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52508 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52514 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52520 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52532 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52546 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52556 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52570 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52572 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52490 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:39,355 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=10292576494467137270, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:39,355 - lmdeploy - [37mINFO[0m - session_id=241, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [ProcessInferRequests] Request for 241 received. = 1, n_tok = 19, max_q = 19, max_k = 19 INFO: 100.64.0.22:52502 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:39,358 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=10677783237007718100, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:39,358 - lmdeploy - [37mINFO[0m - session_id=242, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [ProcessInferRequests] Request for 242 received. = 1, n_tok = 20, max_q = 19, max_k = 20 INFO: 100.64.0.22:52584 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:39,519 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=16861649626068323212, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:39,520 - lmdeploy - [37mINFO[0m - session_id=243, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... INFO: 100.64.0.22:52598 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:39,530 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=2226647364710619146, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:39,530 - lmdeploy - [37mINFO[0m - session_id=244, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... INFO: 100.64.0.22:52596 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:39,532 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=17040338072643901217, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:39,532 - lmdeploy - [37mINFO[0m - session_id=245, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... INFO: 100.64.0.22:52606 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:39,539 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=6133254649292431233, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:39,539 - lmdeploy - [37mINFO[0m - session_id=246, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... INFO: 100.64.0.22:52608 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:39,556 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=12306512449328406682, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:39,556 - lmdeploy - [37mINFO[0m - session_id=247, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [ProcessInferRequests] Request for 243 received. [TM][INFO] [ProcessInferRequests] Request for 244 received. [TM][INFO] [ProcessInferRequests] Request for 245 received. [TM][INFO] [ProcessInferRequests] Request for 246 received. [TM][INFO] [ProcessInferRequests] Request for 247 received. [TM][INFO] ------------------------- step = 20 ------------------------- = 5, n_tok = 97, max_q = 19, max_k = 21 INFO: 100.64.0.22:52624 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:39,559 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=142494031197758727, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:39,559 - lmdeploy - [37mINFO[0m - session_id=248, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... INFO: 100.64.0.22:52630 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:39,631 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=13507102866481054813, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:39,631 - lmdeploy - [37mINFO[0m - session_id=249, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False INFO: 100.64.0.22:52636 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:39,659 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=10865922330371512118, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:39,659 - lmdeploy - [37mINFO[0m - session_id=250, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 0, id = 241 [TM][INFO] [ProcessInferRequests] Request for 248 received. [TM][INFO] ------------------------- step = 20 ------------------------- = 1, n_tok = 25, max_q = 19, max_k = 21 [TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 0, id = 242 [TM][INFO] [forward] Request completed for 140[TM][INFO] ------------------------- step = 20 ------------------------- [TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 0, id = 243 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 3, id = 247 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 2, id = 246 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 1, id = 245 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] ------------------------- step = 30 ------------------------- [TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 0, id = 244 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] ------------------------- step = 30 ------------------------- [TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 0, id = 248 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [ProcessInferRequests] Request for 250 received. = 1, n_tok = 19, max_q = 19, max_k = 19 [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [ProcessInferRequests] Request for 249 received. = 1, n_tok = 20, max_q = 19, max_k = 20 [TM][INFO] ------------------------- step = 20 ------------------------- [TM][INFO] ------------------------- step = 30 ------------------------- [TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 0, id = 250 [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 0, id = 249 INFO: 100.64.0.22:52012 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52020 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52022 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52030 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52044 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52052 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52056 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52066 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52078 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52094 - "OPTIONS /v1/completions HTTP/1.1" 200 OK INFO: 100.64.0.22:52020 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:43,720 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=11159702635018833856, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:43,721 - lmdeploy - [37mINFO[0m - session_id=251, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] [forward] Request completed for 140[TM][INFO] [forward] Request completed for 140[TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [ProcessInferRequests] Request for 251 received. = 1, n_tok = 19, max_q = 19, max_k = 19 INFO: 100.64.0.22:52012 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:43,826 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=1830874506135434196, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:43,827 - lmdeploy - [37mINFO[0m - session_id=252, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] ------------------------- step = 20 ------------------------- INFO: 100.64.0.22:52098 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:43,990 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=6573974936237639461, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:43,990 - lmdeploy - [37mINFO[0m - session_id=253, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False INFO: 100.64.0.22:52104 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:43,991 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=226059032543331669, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:43,991 - lmdeploy - [37mINFO[0m - session_id=254, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False INFO: 100.64.0.22:52118 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:43,997 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=6547953894417428486, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:43,997 - lmdeploy - [37mINFO[0m - session_id=255, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False INFO: 100.64.0.22:52132 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:43,998 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=6829381051806890829, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:43,998 - lmdeploy - [37mINFO[0m - session_id=256, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False INFO: 100.64.0.22:52146 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:44,004 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=7965620624864210075, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:44,004 - lmdeploy - [37mINFO[0m - session_id=257, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False INFO: 100.64.0.22:52154 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:44,019 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=10633507158980434432, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:44,019 - lmdeploy - [37mINFO[0m - session_id=258, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False INFO: 100.64.0.22:52164 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:44,024 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=10345931024206508011, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:44,024 - lmdeploy - [37mINFO[0m - session_id=259, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False INFO: 100.64.0.22:52168 - "POST /v1/completions HTTP/1.1" 200 OK 2024-06-17 04:42:44,036 - lmdeploy - [37mINFO[0m - prompt333='USER: Write a story about a cat.\nASSISTANT: Once upon a', gen_config=EngineGenerationConfig(n=1, max_new_tokens=512, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=7371378611956317668, stop_words=None, bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 3148, 1001, 29901, 14350, 263, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 9038, 2501, 263], adapter_name=None. 2024-06-17 04:42:44,036 - lmdeploy - [37mINFO[0m - session_id=260, history_tokens=0, input_tokens=19, max_new_tokens=512, seq_start=True, seq_end=True, step=0, prep=False [TM][INFO] ------------------------- step = 30 ------------------------- [TM][INFO] Set logger level by INFO [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] [Interrupt] slot = 0, id = 251 ```

AllentDan commented 2 weeks ago

Hi, @josephrocca may remove https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/turbomind/turbomind.py#L526 line . We actually did try catch such exception inside the AsyncEngine. But somehow the bug was introduced by the line of sleep. In my testing, all the requests were finished after removing the line.

josephrocca commented 1 week ago

Yep, that fixed it! I removed the await asyncio.sleep(0.002) from the async_cancel function and it seems to work now. Thank you!

InternLM / lmdeploy