Open lai-serena opened 3 months ago
我也遇到了一样的问题,也是qwen2-72b, 我的请求的tokens数量大概是你的1/3,我做了限制,但是也同样是在你3倍的请求后(900)个开始挂起。不知是否解决了?
而且我发现我跑了7b也遇到了一样的问题
同样的问题
same 2*A100 26B
@zhulinJulia24 could you help try to reproduce this issue?
同样的问题
@lvhan028 感觉这个问题是个大bug,vl的模型,我用着经常会遇到这个偶发卡住的问题。没有报错,就是hang住不返回。看着像是accelerate和lmdeploy的集合通信互相死锁了,因为请求是异步发出的,vit的推理和llm的推理实际上是流水线重叠的。trace的日志:
-- Stack for thread 23201439544896 ---
File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
work_item.run()
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/engine.py", line 108, in forward
outputs = self.model.forward(inputs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 174, in forward
return self._forward_func(images)
File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 155, in _forward_v1_5
outputs = self.model.extract_feature(outputs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_internvl_chat.py", line 216, in extract_feature
vit_embeds = self.vision_model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 418, in forward
encoder_outputs = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 354, in forward
layer_outputs = encoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 164, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 363, in pre_forward
return send_to_device(args, self.execution_device), send_to_device(
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 174, in send_to_device
return honor_type(
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 81, in honor_type
return type(obj)(generator)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 175, in <genexpr>
tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 155, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
@lvhan028 感觉这个问题是个大bug,vl的模型,我用着经常会遇到这个偶发卡住的问题。没有报错,就是hang住不返回。看着像是accelerate和lmdeploy的集合通信互相死锁了,因为请求是异步发出的,vit的推理和llm的推理实际上是流水线重叠的。trace的日志:
-- Stack for thread 23201439544896 --- File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap self._bootstrap_inner() File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker work_item.run() File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/engine.py", line 108, in forward outputs = self.model.forward(inputs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 174, in forward return self._forward_func(images) File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 155, in _forward_v1_5 outputs = self.model.extract_feature(outputs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_internvl_chat.py", line 216, in extract_feature vit_embeds = self.vision_model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 418, in forward encoder_outputs = self.encoder( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 354, in forward layer_outputs = encoder_layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 164, in new_forward args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 363, in pre_forward return send_to_device(args, self.execution_device), send_to_device( File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 174, in send_to_device return honor_type( File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 81, in honor_type return type(obj)(generator) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 175, in <genexpr> tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 155, in send_to_device return tensor.to(device, non_blocking=non_blocking)
请问有找到解决办法吗?还挺经常出现的
@irexyc may follow up this issue
@lai-serena @DefTruth
可以减少一些kvcache的占用(--cache-max-entry-count 0.4 或者更少),预留更多的显存buffer(比如5个G),再观察一下么?
关于99%不返回的问题,最好在server测,启动的时候加上日志 (--log-level INFO),之前有遇到过个别请求无法停止造成生成阶段耗时很长的情况。
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
使用阿里官方的qwen2-72b-instruct-awq模型,每隔0.3s发送一次请求,每次请求大约3570tokens,开始260条很快就有返回,但是到了260多条请求后,就会突然卡住,需要等很长一段时间才有返回,期间没有出现错误,显示GPU利用率99%,请问这是什么原因?
Reproduction
我的启动语句是:
lmdeploy serve api_server /workspace/qwen/Qwen2-72B-Instruct-AWQ --server-port 6005 --tp 2 --model-name [model_name] --cache-max-entry-count 0.8
这是我部分代码:
def llm_result(query): json_data2 = { 'model': [model_name], 'messages': [ #所有的content加起来约3570tokens { 'role':'system', 'content':'xxx' }, { 'role': 'user', 'content': f'''xxx''' } ], } response = requests.post('http://[ip]:6005/v1/chat/completions', headers=headers, json=json_data2) text = json.loads(response.text) message = text["choices"][0]["message"]["content"] return message def main(): file="abc.xlsx" excel_file = os.path.join(dirs, file) df = pd.read_excel(excel_file) datas = df.values for data in datas: content=data[5] message=llm_result(content) time.sleep(0.3) print(message)
Environment
sys.platform: linux Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1: NVIDIA A100-SXM4-40GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.8, V11.8.89 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 2.1.0 PyTorch compiling details: PyTorch built with: - GCC 9.3 - C++ Version: 201703 - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4) - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX512 - CUDA Runtime 11.8 - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37 - CuDNN 8.7 - Magma 2.6.1 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, TorchVision: 0.16.0 LMDeploy: 0.5.1+unknown transformers: 4.42.4 gradio: 4.38.1 fastapi: 0.111.1 pydantic: 2.8.2 triton: 2.1.0 NVIDIA Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 0-35,72-107 0 N/A GPU1 NV12 X 0-35,72-107 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
Error traceback
No response
I cannot reproduce it on A100 80G
My script is :
import requests
import json
import time
def llm_result(query):
json_data2 = {
'model': 'qwen2',
'messages': [
{
'role':'system',
'content':''
},
{
'role': 'user',
'content': query
}
],
}
headers = {'Content-Type':'application/json'}
response = requests.post('http://0.0.0.0:6005/v1/chat/completions', headers=headers, json=json_data2)
text = json.loads(response.text)
message = text["choices"][0]["message"]["content"]
return message
datas = ["你好,你是谁"*1000]*600
for data in datas:
content=data
#print(content)
start_time = time.time()
message=llm_result(content)
end_time = time.time()
task_duration_seconds = round(end_time - start_time, 2)
time.sleep(0.3)
print(task_duration_seconds)
the input content is 4000tokens. And the time consumed for each response is about 1-3s. can you try to add --cache-max-entry-count 0.4 when start the api server.
Checklist
Describe the bug
使用阿里官方的qwen2-72b-instruct-awq模型,每隔0.3s发送一次请求,每次请求大约3570tokens,开始260条很快就有返回,但是到了260多条请求后,就会突然卡住,需要等很长一段时间才有返回,期间没有出现错误,显示GPU利用率99%,请问这是什么原因?
Reproduction
我的启动语句是:
lmdeploy serve api_server /workspace/qwen/Qwen2-72B-Instruct-AWQ --server-port 6005 --tp 2 --model-name [model_name] --cache-max-entry-count 0.8
这是我部分代码:
Environment
Error traceback
No response