[BUG]显存使用率不平均（显存使用率40%和60%），推理过程中GPU负载也不平均（分别为35%和59%）

lhtpluto commented 5 months ago

问题描述 / Problem Description 用简洁明了的语言描述这个问题 / Describe the problem in a clear and concise manner. [BUG]显存使用率不平均（显存使用率40%和60%），推理过程中GPU负载也不平均（分别为35%和59%）

尝试通过device map映射方式，通过设置device_map参数（例如"auto", "balanced", "balanced_low_0", "sequential"）改善，但找不到更改的位置

两张4090 ubuntu2204 Driver Version: 550.54.14 CUDA Version: 12.4 transformers 4.39.3 torch 2.1.2 torchaudio 2.1.2 torchvision 0.16.2 autoawq 0.2.3+cu121 /home/test/Langchain-Chatchat/AutoAWQ autoawq-kernels 0.0.6+cu121 /home/test/Langchain-Chatchat/AutoAWQ_kernels-main

环境信息 / Environment Information

langchain-ChatGLM 版本/commit 号：操作系统：Linux-6.5.0-26-generic-x86_64-with-glibc2.35. python版本：3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] 项目版本：v0.2.10 langchain版本：0.0.354. fastchat版本：0.2.35

当前使用的分词器：ChineseRecursiveTextSplitter 当前启动的LLM模型：['Qwen1.5-32B-Chat-AWQ'] @ cuda {'device': 'cuda', 'host': '0.0.0.0', 'infer_turbo': False, 'model_path': 'Qwen/Qwen1.5-32B-Chat-AWQ', 'model_path_exists': True, 'port': 20002} 当前Embbedings模型： bce-embedding-base_v1 @ cuda

是否使用 Docker 部署（是/否）：否
使用的模型（ChatGLM2-6B / Qwen-7B 等）：Qwen1.5-32B-Chat-AWQ
使用的 Embedding 模型（moka-ai/m3e-base 等）：bce-embedding-base_v1 @ cuda
使用的向量库类型 (faiss / milvus / pg_vector 等)： faiss
操作系统及版本 / Operating system and version: ubuntu2204
Python 版本 / Python version: 3.11.7
其他相关环境信息 / Other relevant environment information:

屏幕截图 2024-04-07 161843 屏幕截图 2024-04-07 162216

dosubot[bot] commented 5 months ago

Hey @lhtpluto! Great to see you back here, diving into the depths of tech with us again. How's the journey been since our last encounter?

Based on the information you've provided, it seems that the uneven GPU usage might be due to how the workload is distributed across your GPUs. Here are some steps you can take to address this issue:

Ensure Equal Workload Distribution: Langchain-Chatchat allows you to specify the number of GPUs and their IDs, but it doesn't explicitly show how the GPUs are managed during the inference process. You should ensure that your application is correctly utilizing both GPUs. This can be achieved by implementing or ensuring there is logic within launch_all or any relevant part of the code that handles GPU workload distribution. This might involve splitting the inference tasks equally and assigning them to each GPU.
Utilize PyTorch or TensorFlow's Built-in Functions: If Langchain-Chatchat is built on top of PyTorch or TensorFlow, make use of their built-in functions for multi-GPU support. For PyTorch, torch.nn.DataParallel or torch.nn.parallel.DistributedDataParallel can be used. For TensorFlow, tf.distribute.Strategy can be utilized.
Explicit GPU Memory Management: The --max-gpu-memory argument suggests there is a mechanism to limit GPU memory usage. Ensure this is correctly implemented to prevent one GPU from being overloaded. You might need to adjust the memory allocation based on the tasks' requirements.
Review and Optimize the Model Splitting (if applicable): If the inference model is split across GPUs, review the splitting logic. Uneven model splitting can lead to one GPU doing more work than the other.
Monitor and Debug GPU Utilization: Utilize tools like NVIDIA's nvidia-smi to monitor GPU utilization and memory usage in real-time. This can help identify if the load balancing adjustments are effective.
Update to Latest Software Versions: Ensure that Langchain-Chatchat, along with its dependencies and the underlying deep learning frameworks (PyTorch, TensorFlow, etc.), are updated to their latest versions. Sometimes, performance improvements and bug fixes in newer versions can resolve such issues.

Without specific details on how the GPUs are being utilized within the launch_all function or elsewhere in the code related to the inference process, these are general recommendations. For a more precise solution, reviewing the implementation details of GPU utilization within the Langchain-Chatchat inference process is necessary.

For more information on how Langchain-Chatchat handles GPU workload distribution, you can refer to the server/llm_api_stale.py file in the repository. This file contains the code that allows for flexible GPU workload distribution by specifying which GPUs to use, limiting the number of GPUs, managing GPU memory usage, and controlling model concurrency. This is achieved through a combination of command-line arguments and the setting of environment variables.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

zRzRzRzRzRzRzR commented 5 months ago

因为是用fscaht加载，transformers的多卡是这样的

ilovesouthpark commented 5 months ago

在fschat的model_adapter.py的def load_model这里进行修改。你再试试，我改好的文件被我升级fschat覆盖了，我好像没做备份，只能再去调试了哈哈。

chatchat-space / Langchain-Chatchat

[BUG]显存使用率不平均（显存使用率40%和60%），推理过程中GPU负载也不平均（分别为35%和59%） #3652

Sources