Closed Wimeremce7 closed 5 months ago
单张卡没问题
Based on the information you've provided and the similar issues I found in the Langchain-Chatchat repository, it seems like the problem you're experiencing might be related to the configuration for multi-GPU support, particularly in the handling of tensor parallelism and memory allocation.
Here are some steps you can take to resolve this issue:
Tensor Parallel Size and Number of GPUs: Ensure that the args.tensor_parallel_size
matches the number of GPUs you intend to use (args.num_gpus
). This setting ensures that the model parallelism is correctly configured to distribute the workload evenly across the GPUs.
if args.num_gpus > 1:
args.tensor_parallel_size = args.num_gpus
Memory Management: Check settings such as args.swap_space
, args.gpu_memory_utilization
, and args.max_gpu_memory
. These are critical for managing how memory is allocated and used across GPUs. Ensure these are set to values that match your system's capabilities and the requirements of your model to prevent memory overflow or inefficient memory usage.
args.swap_space = 4 # GiB
args.gpu_memory_utilization = 0.90
args.max_gpu_memory = "22GiB"
Concurrency and Batch Sizes: The args.limit_worker_concurrency
and args.max_num_batched_tokens
settings control how many concurrent operations can run and how large batches can be. Adjusting these values can help manage the load on each GPU and prevent garbled responses due to overloading.
args.limit_worker_concurrency = 5
args.max_num_batched_tokens = None # Adjust based on your GPU capabilities
You can find these configurations in the startup.py
file in the Langchain-Chatchat repository.
Regarding the infer_turbo
parameter in the FastAPI server configuration, it is used to specify the inference acceleration framework to be used. When set to 'vllm'
, it indicates that the VLLM (a likely shorthand for a specific acceleration library or framework) is being used for inference acceleration. However, the configuration comments mention that using VLLM might lead to communication issues with HuggingFace, and it's also noted that support for some models with VLLM is not mature, hence it is disabled (False
) by default. The effectiveness of infer_turbo
in a multi-GPU context would likely depend on its compatibility with the specific models being used and how well it can distribute the computation across the available GPUs.
Please try these suggestions and let me know if they help resolve the issue. If the problem persists, it would be helpful to have more detailed error logs or screenshots to better understand and address the issue.
有遇到这个情况的没?使用vllm多卡是可以的,fschat多卡就是乱码
同样遇到该情况+1,多卡部署要不就乱码或者后端报错
同样这个情况,官方docker部署。
但是在wsl2+cu118里面fschat多卡就正常
用vllm作为后端
问题描述 / Problem Description 使用多张卡共担显存时,回复乱码
环境信息 / Environment Information
首先回复界面是: 控制台界面是:
server_config.py的配置是: 只改动了这块
麻烦帮忙看下是怎么回事,需要修改哪些配置