[Bug] 多卡Tensor并行环境下流式响应中断后报错Segmentation fault (core dumped)

oaksharks commented 1 year ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

开发人员，您好！

使用Openai 方式调用时，开启stream，开启tensor并行tp=2, 如果在客户端中断后再访问服务会报错Segmentation fault (core dumped)

如果设置tp=1则没有这个问题。

预期：客户端中断不应该导致服务端异常退出。

期待您回复！

Reproduction

复现该问题的代码：

import openai

openai.api_base = "http://myserver/v1"
openai.api_key = "123456"

def openai_chat(prompt):
    response = openai.ChatCompletion.create(
        model="codellama",
        messages=[ {"role": "user", "content": prompt }],
        temperature=0.3,
        top_p=0.7,
        do_sample=True,
        repetition_penalty=1.0,
        stream=True,
        max_tokens=2048
    )
    i = 0
    for chunk in response:
        i += 1
        if i == 3:
            break

# 调用
for i in range(10):
    openai_chat("hello")
    print(i)

代码输出：

0
1

{
    "name": "APIConnectionError",
    "message": "Error communicating with OpenAI: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))",
    "stack": "---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)
File ~/miniconda3/envs/myenv/lib/python3.10/site-packages/urllib3/connectionpool.py:790, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    789 # Make the request on the HTTPConnection object
--> 790 response = self._make_request(
    791     conn,
    792     method,
    793     url,
    794     timeout=timeout_obj,
    795     body=body,
    796     headers=headers,
    797     chunked=chunked,
    798     retries=retries,
    799     response_conn=response_conn,
    800     preload_content=preload_content,
    801     decode_content=decode_content,
    802     **response_kw,
    803 )
    805 # Everything went great!

启动命令：

lmdeploy serve api_server ./workspace 0.0.0.0 --server_port 8080  --instance_num 32 --tp 2

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
LMDeploy: 0.0.14+
transformers: 4.35.0
gradio: 3.50.2
fastapi: 0.104.1
pydantic: 2.4.2

Error traceback

No response

lvhan028 commented 1 year ago

诶？可以通过 openai 这个包访问的么？我还从来没有试过 @AllentDan 麻烦同步下如下信息：

codellama的哪个模型呢？
lmdeploy 哪个版本？
显卡型号，显存大小？

oaksharks commented 1 year ago

@lvhan028 感谢您及时回复。

模型为：CodeLlama-34b-Instruct-hf
lmdeploy版本： LMDeploy: 0.0.14+
显卡: A800/80G 版

OpenAI API 使用： https://github.com/InternLM/lmdeploy/blob/7b20cfdf0ac3819dcf6978dc8ddb49b5d2cda5a9/docs/en/restful_api.md?plain=1#L9-L10

AllentDan commented 1 year ago

能复现，似乎是我们停止会话，会触发这个问题。

lvhan028 commented 1 year ago

能复现，似乎是我们停止会话，会触发这个问题。

@AllentDan 还请跟进下

AllentDan commented 1 year ago

是 turbomind 多卡程序 stop 就会一定概率挂掉，app.py 直接运行 turbomind tp，按 cancel 按钮也会一定概率触发。

AllentDan commented 1 year ago

@grimoire 这个可以帮忙看下原因吗？好像多卡停止会偶尔挂掉

zhongpei commented 1 year ago

我测试的结果在 --instance_num 1 的时候多线程访问不会core dump lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port 8000 --instance_num 1 --tp 2

在 --instance_num 10 的时候多线程访问（只要大于1），每次都会core dump lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port 8000 --instance_num 1 0--tp 2

AllentDan commented 1 year ago

PR 686 修复了

InternLM / lmdeploy