InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.77k stars 433 forks source link

[Bug] 多卡Tensor并行环境下流式响应中断后报错Segmentation fault (core dumped) #676

Closed oaksharks closed 1 year ago

oaksharks commented 1 year ago

Checklist

Describe the bug

开发人员,您好!

使用Openai 方式调用时,开启stream,开启tensor并行tp=2, 如果在客户端中断后再访问服务会报错Segmentation fault (core dumped)

如果设置tp=1则没有这个问题。

预期: 客户端中断不应该导致服务端异常退出。

期待您回复!

Reproduction

复现该问题的代码:

import openai

openai.api_base = "http://myserver/v1"
openai.api_key = "123456"

def openai_chat(prompt):
    response = openai.ChatCompletion.create(
        model="codellama",
        messages=[ {"role": "user", "content": prompt }],
        temperature=0.3,
        top_p=0.7,
        do_sample=True,
        repetition_penalty=1.0,
        stream=True,
        max_tokens=2048
    )
    i = 0
    for chunk in response:
        i += 1
        if i == 3:
            break

# 调用
for i in range(10):
    openai_chat("hello")
    print(i)

代码输出:

0
1

{
    "name": "APIConnectionError",
    "message": "Error communicating with OpenAI: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))",
    "stack": "---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)
File ~/miniconda3/envs/myenv/lib/python3.10/site-packages/urllib3/connectionpool.py:790, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    789 # Make the request on the HTTPConnection object
--> 790 response = self._make_request(
    791     conn,
    792     method,
    793     url,
    794     timeout=timeout_obj,
    795     body=body,
    796     headers=headers,
    797     chunked=chunked,
    798     retries=retries,
    799     response_conn=response_conn,
    800     preload_content=preload_content,
    801     decode_content=decode_content,
    802     **response_kw,
    803 )
    805 # Everything went great!

启动命令:

lmdeploy serve api_server ./workspace 0.0.0.0 --server_port 8080  --instance_num 32 --tp 2

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
LMDeploy: 0.0.14+
transformers: 4.35.0
gradio: 3.50.2
fastapi: 0.104.1
pydantic: 2.4.2

Error traceback

No response

lvhan028 commented 1 year ago

诶?可以通过 openai 这个包访问的么?我还从来没有试过 @AllentDan 麻烦同步下如下信息:

  1. codellama的哪个模型呢?
  2. lmdeploy 哪个版本?
  3. 显卡型号,显存大小?
oaksharks commented 1 year ago

@lvhan028 感谢您及时回复。

OpenAI API 使用: https://github.com/InternLM/lmdeploy/blob/7b20cfdf0ac3819dcf6978dc8ddb49b5d2cda5a9/docs/en/restful_api.md?plain=1#L9-L10

AllentDan commented 1 year ago

能复现,似乎是我们停止会话,会触发这个问题。

lvhan028 commented 1 year ago

能复现,似乎是我们停止会话,会触发这个问题。

@AllentDan 还请跟进下

AllentDan commented 1 year ago

是 turbomind 多卡程序 stop 就会一定概率挂掉,app.py 直接运行 turbomind tp,按 cancel 按钮也会一定概率触发。

AllentDan commented 1 year ago

@grimoire 这个可以帮忙看下原因吗?好像多卡停止会偶尔挂掉

zhongpei commented 1 year ago

我测试的结果在 --instance_num 1 的时候多线程访问不会core dump lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port 8000 --instance_num 1 --tp 2

在 --instance_num 10 的时候多线程访问(只要大于1),每次都会core dump lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port 8000 --instance_num 1 0--tp 2

AllentDan commented 1 year ago

PR 686 修复了