多卡推理模型回复乱码

Wimeremce7 commented 7 months ago

问题描述 / Problem Description 使用多张卡共担显存时，回复乱码

环境信息 / Environment Information

langchain-chatchat版本/commit 号：0.2.10
是否使用 Docker 部署（是/否）：否
使用的模型（ChatGLM2-6B / Qwen-7B 等）：ChatGLM3-6B-32k 、Qwen-72B-Int4-Chat
使用的 Embedding 模型（moka-ai/m3e-base 等）：bge-large-zh-v1.5
使用的向量库类型 (faiss / milvus / pg_vector 等)： faiss
操作系统及版本 / Operating system and version: ubuntu 20.04、openeuler 22.04 都试过不行
Python 版本 / Python version: 3.11、3.10、3.9 都试过了
其他相关环境信息 / Other relevant environment information: 4090*3 cuda 12.1

首先回复界面是：控制台界面是：

server_config.py的配置是：只改动了这块

麻烦帮忙看下是怎么回事，需要修改哪些配置

Wimeremce7 commented 7 months ago

单张卡没问题

dosubot[bot] commented 7 months ago

Based on the information you've provided and the similar issues I found in the Langchain-Chatchat repository, it seems like the problem you're experiencing might be related to the configuration for multi-GPU support, particularly in the handling of tensor parallelism and memory allocation.

Here are some steps you can take to resolve this issue:

Tensor Parallel Size and Number of GPUs: Ensure that the args.tensor_parallel_size matches the number of GPUs you intend to use (args.num_gpus). This setting ensures that the model parallelism is correctly configured to distribute the workload evenly across the GPUs.
```
if args.num_gpus > 1:
    args.tensor_parallel_size = args.num_gpus
```
Memory Management: Check settings such as args.swap_space, args.gpu_memory_utilization, and args.max_gpu_memory. These are critical for managing how memory is allocated and used across GPUs. Ensure these are set to values that match your system's capabilities and the requirements of your model to prevent memory overflow or inefficient memory usage.
```
args.swap_space = 4  # GiB
args.gpu_memory_utilization = 0.90
args.max_gpu_memory = "22GiB"
```
Concurrency and Batch Sizes: The args.limit_worker_concurrency and args.max_num_batched_tokens settings control how many concurrent operations can run and how large batches can be. Adjusting these values can help manage the load on each GPU and prevent garbled responses due to overloading.
```
args.limit_worker_concurrency = 5
args.max_num_batched_tokens = None  # Adjust based on your GPU capabilities
```

You can find these configurations in the startup.py file in the Langchain-Chatchat repository.

Regarding the infer_turbo parameter in the FastAPI server configuration, it is used to specify the inference acceleration framework to be used. When set to 'vllm', it indicates that the VLLM (a likely shorthand for a specific acceleration library or framework) is being used for inference acceleration. However, the configuration comments mention that using VLLM might lead to communication issues with HuggingFace, and it's also noted that support for some models with VLLM is not mature, hence it is disabled (False) by default. The effectiveness of infer_turbo in a multi-GPU context would likely depend on its compatibility with the specific models being used and how well it can distribute the computation across the available GPUs.

Please try these suggestions and let me know if they help resolve the issue. If the problem persists, it would be helpful to have more detailed error logs or screenshots to better understand and address the issue.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

Wimeremce7 commented 6 months ago

有遇到这个情况的没？使用vllm多卡是可以的，fschat多卡就是乱码

hyxrpk commented 6 months ago

同样遇到该情况+1，多卡部署要不就乱码或者后端报错

Zephyr69 commented 5 months ago

同样这个情况，官方docker部署。

但是在wsl2+cu118里面fschat多卡就正常

zRzRzRzRzRzRzR commented 5 months ago

用vllm作为后端

chatchat-space / Langchain-Chatchat

多卡推理模型回复乱码 #3217

Sources