InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.59k stars 420 forks source link

当连续请求200多次后,出现突然卡住的情况 #2231

Open lai-serena opened 3 months ago

lai-serena commented 3 months ago

Checklist

Describe the bug

使用阿里官方的qwen2-72b-instruct-awq模型,每隔0.3s发送一次请求,每次请求大约3570tokens,开始260条很快就有返回,但是到了260多条请求后,就会突然卡住,需要等很长一段时间才有返回,期间没有出现错误,显示GPU利用率99%,请问这是什么原因? 1722848281486

Reproduction

我的启动语句是: lmdeploy serve api_server /workspace/qwen/Qwen2-72B-Instruct-AWQ --server-port 6005 --tp 2 --model-name [model_name] --cache-max-entry-count 0.8

这是我部分代码:

def llm_result(query):
    json_data2 = {
        'model': [model_name],
        'messages': [
           #所有的content加起来约3570tokens
            {
                'role':'system',
                'content':'xxx' 
            },
            {
                'role': 'user',
                'content': f'''xxx'''
            }
        ],
    }
    response = requests.post('http://[ip]:6005/v1/chat/completions', headers=headers, json=json_data2)
    text = json.loads(response.text)
    message = text["choices"][0]["message"]["content"]
    return message

def main():
    file="abc.xlsx"
    excel_file = os.path.join(dirs, file)
    df = pd.read_excel(excel_file)
    datas = df.values
    for data in datas:
        content=data[5]
        message=llm_result(content)
        time.sleep(0.3)
        print(message)

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0
LMDeploy: 0.5.1+unknown
transformers: 4.42.4
gradio: 4.38.1
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.1.0
NVIDIA Topology: 
    GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV12    0-35,72-107 0       N/A
GPU1    NV12     X  0-35,72-107 0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

No response

ChunyiY commented 3 months ago

我也遇到了一样的问题,也是qwen2-72b, 我的请求的tokens数量大概是你的1/3,我做了限制,但是也同样是在你3倍的请求后(900)个开始挂起。不知是否解决了?

ChunyiY commented 3 months ago

而且我发现我跑了7b也遇到了一样的问题

bltcn commented 3 months ago

同样的问题

fanghostt commented 2 months ago

same 2*A100 26B

lvhan028 commented 2 months ago

@zhulinJulia24 could you help try to reproduce this issue?

DefTruth commented 2 months ago

同样的问题

DefTruth commented 2 months ago

@lvhan028 感觉这个问题是个大bug,vl的模型,我用着经常会遇到这个偶发卡住的问题。没有报错,就是hang住不返回。看着像是accelerate和lmdeploy的集合通信互相死锁了,因为请求是异步发出的,vit的推理和llm的推理实际上是流水线重叠的。trace的日志:

-- Stack for thread 23201439544896 ---
  File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
    work_item.run()
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/engine.py", line 108, in forward
    outputs = self.model.forward(inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 174, in forward
    return self._forward_func(images)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 155, in _forward_v1_5
    outputs = self.model.extract_feature(outputs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_internvl_chat.py", line 216, in extract_feature
    vit_embeds = self.vision_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 418, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 354, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 164, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 363, in pre_forward
    return send_to_device(args, self.execution_device), send_to_device(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 174, in send_to_device
    return honor_type(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 81, in honor_type
    return type(obj)(generator)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 175, in <genexpr>
    tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
lai-serena commented 2 months ago

@lvhan028 感觉这个问题是个大bug,vl的模型,我用着经常会遇到这个偶发卡住的问题。没有报错,就是hang住不返回。看着像是accelerate和lmdeploy的集合通信互相死锁了,因为请求是异步发出的,vit的推理和llm的推理实际上是流水线重叠的。trace的日志:

-- Stack for thread 23201439544896 ---
  File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
    work_item.run()
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/engine.py", line 108, in forward
    outputs = self.model.forward(inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 174, in forward
    return self._forward_func(images)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 155, in _forward_v1_5
    outputs = self.model.extract_feature(outputs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_internvl_chat.py", line 216, in extract_feature
    vit_embeds = self.vision_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 418, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 354, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 164, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 363, in pre_forward
    return send_to_device(args, self.execution_device), send_to_device(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 174, in send_to_device
    return honor_type(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 81, in honor_type
    return type(obj)(generator)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 175, in <genexpr>
    tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)

请问有找到解决办法吗?还挺经常出现的

lvhan028 commented 2 months ago

@irexyc may follow up this issue

irexyc commented 2 months ago

@lai-serena @DefTruth

可以减少一些kvcache的占用(--cache-max-entry-count 0.4 或者更少),预留更多的显存buffer(比如5个G),再观察一下么?

关于99%不返回的问题,最好在server测,启动的时候加上日志 (--log-level INFO),之前有遇到过个别请求无法停止造成生成阶段耗时很长的情况。

zhulinJulia24 commented 2 months ago

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用阿里官方的qwen2-72b-instruct-awq模型,每隔0.3s发送一次请求,每次请求大约3570tokens,开始260条很快就有返回,但是到了260多条请求后,就会突然卡住,需要等很长一段时间才有返回,期间没有出现错误,显示GPU利用率99%,请问这是什么原因? 1722848281486

Reproduction

我的启动语句是: lmdeploy serve api_server /workspace/qwen/Qwen2-72B-Instruct-AWQ --server-port 6005 --tp 2 --model-name [model_name] --cache-max-entry-count 0.8

这是我部分代码:

def llm_result(query):
    json_data2 = {
        'model': [model_name],
        'messages': [
           #所有的content加起来约3570tokens
            {
                'role':'system',
                'content':'xxx' 
            },
            {
                'role': 'user',
                'content': f'''xxx'''
            }
        ],
    }
    response = requests.post('http://[ip]:6005/v1/chat/completions', headers=headers, json=json_data2)
    text = json.loads(response.text)
    message = text["choices"][0]["message"]["content"]
    return message

def main():
  file="abc.xlsx"
  excel_file = os.path.join(dirs, file)
  df = pd.read_excel(excel_file)
  datas = df.values
  for data in datas:
      content=data[5]
      message=llm_result(content)
      time.sleep(0.3)
      print(message)

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0
LMDeploy: 0.5.1+unknown
transformers: 4.42.4
gradio: 4.38.1
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.1.0
NVIDIA Topology: 
  GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0   X  NV12    0-35,72-107 0       N/A
GPU1  NV12     X  0-35,72-107 0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

No response

I cannot reproduce it on A100 80G

My script is :

import requests
import json
import time
def llm_result(query):
    json_data2 = {
        'model': 'qwen2',
        'messages': [
            {
                'role':'system',
                'content':''
            },
            {
                'role': 'user',
                'content': query
            }
        ],
    }
    headers = {'Content-Type':'application/json'}
    response = requests.post('http://0.0.0.0:6005/v1/chat/completions', headers=headers, json=json_data2)
    text = json.loads(response.text)
    message = text["choices"][0]["message"]["content"]
    return message

datas = ["你好,你是谁"*1000]*600

for data in datas:
    content=data
    #print(content)
    start_time = time.time()
    message=llm_result(content)
    end_time = time.time()
    task_duration_seconds = round(end_time - start_time, 2)
    time.sleep(0.3)
    print(task_duration_seconds)

the input content is 4000tokens. And the time consumed for each response is about 1-3s. can you try to add --cache-max-entry-count 0.4 when start the api server.