chatchat-space / Langchain-Chatchat

Langchain-Chatchat(原Langchain-ChatGLM, Qwen 与 Llama 等)基于 Langchain 与 ChatGLM 等语言模型的 RAG 与 Agent 应用 | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like ChatGLM, Qwen and Llama) RAG and Agent app with langchain
Apache License 2.0
29.69k stars 5.2k forks source link

[BUG] 程序正常运行,就是回答太慢了,一个字一个字的要输出2分钟 #3549

Closed Mioooooo closed 3 months ago

Mioooooo commented 3 months ago

问题描述 / Problem Description 程序正常运行,就是回答太慢了,一个字一个字的要输出2分钟,使用的显卡为2块1066TI 6G,配置文件配置了4G的显存,但运算时显卡使用率0%,为啥?我的版本是2.10 cuda为12.1 image

dosubot[bot] commented 3 months ago

嘿,@Mioooooo,又见面了!希望一切顺利 😜

您好,Mioooooo,

根据您的描述,Langchain-Chatchat v0.2.10应用程序运行缓慢,每个字符的输出需要2分钟,尽管在配备两块1066TI 6G显卡和4G配置的GPU内存的系统上运行,并且在CUDA 12.1环境中进行计算时显卡使用率为0%。这可能是由于应用程序的配置和代码没有正确利用可用的GPU资源。

以下是可能导致GPU资源利用不足或无法利用的几个问题:

  1. GPU内存配置:应用程序允许进行GPU内存配置(args.max_gpu_memory = "22GiB"),但系统只配置了4G的GPU内存。这种差异可能不是性能缓慢的直接原因,但表明可能存在配置问题。

  2. GPU利用:代码片段没有明确显示如何利用CUDA或任何GPU加速库(如TensorFlow或PyTorch)。缺乏明确的GPU利用代码(例如,使用CUDA将设备设置为GPU)和依赖可能默认使用CPU的库可能导致GPU使用率为0%。

  3. 多进程和AsyncIO:应用程序使用多进程(import multiprocessing as mp)和AsyncIO(import asyncio),这主要是CPU绑定操作。如果没有明确的代码将某些计算卸载到GPU,这些操作将不会利用GPU,从而导致观察到的性能缓慢。

  4. GPU环境变量:应用程序为NumExpr设置环境变量(os.environ["NUMEXPR_MAX_THREADS"] = str(n_cores)),但没有显示类似的GPU利用配置,例如设置要使用的GPU数量或配置CUDA环境变量。这种缺乏GPU特定环境配置可能导致GPU利用不足。

  5. 依赖过时的功能@deprecated装饰器表明模型启动功能将在0.3.x版本中重写以支持更多模式并加速启动。这表明当前版本可能没有针对性能进行优化,包括GPU利用。

  6. 缺乏明确的GPU分配:代码没有显示将计算或模型明确分配给GPU(例如,在PyTorch中.to(device)或在TensorFlow中设置with tf.device('/GPU:0'):)。如果没有这样的明确分配,计算可能默认使用CPU。

要解决这些问题并提高GPU利用率和整体性能,可以考虑以下步骤:

希望这些信息对您有所帮助。如果您有任何其他问题或需要进一步的澄清,请随时告诉我。

参考资料:

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
Mioooooo commented 3 months ago

我没有编写任何代码,只是单纯启动了Langchain-Chachat v0.2.10应用程序,理论上显卡使用率应该居高不下,但它现在却为0,我是配置文件哪里没写对吗?或者是我的显卡太过于古老导致了这样的后果呢?

Mioooooo commented 3 months ago

==============================Langchain-Chatchat Configuration============================== 操作系统:Windows-10-10.0.22621-SP0. python版本:3.11.8 (tags/v3.11.8:db85d51, Feb 6 2024, 22:03:32) [MSC v.1937 64 bit (AMD64)] 项目版本:v0.2.10 langchain版本:0.0.354. fastchat版本:0.2.35

当前使用的分词器:ChineseRecursiveTextSplitter 当前启动的LLM模型:['chatglm3-6b'] @ cuda {'device': 'cuda', 'gpus': '0,1', 'host': '127.0.0.1', 'infer_turbo': False, 'max_gpu_memory': '4GiB', 'model_path': 'THUDM\chatglm3-6b', 'model_path_exists': True, 'num_gpus': 2, 'port': 20002} 当前Embbedings模型: bge-large-zh-v1.5 @ cuda

Mioooooo commented 3 months ago

@dosu-bot

dosubot[bot] commented 3 months ago

The issue with the GPU usage being 0% despite the correct setup in Langchain-Chatchat v0.2.10 could stem from several factors related to the application's configuration and the environment it's running in. Here are some insights and potential solutions:

  1. Correct GPU Configuration: The application configuration indicates both GPUs are intended for use ('gpus': '0,1'). Ensure the GPU IDs match your system's configuration and that the application is correctly recognizing these IDs.

  2. Environment Variables: For GPU utilization, setting environment variables like CUDA_VISIBLE_DEVICES is crucial. Verify these are correctly set in your system to enable GPU usage by the application.

  3. CUDA/cuDNN Compatibility: The application's ability to utilize GPUs heavily depends on the compatibility between CUDA, cuDNN, and the Python packages (e.g., TensorFlow, PyTorch) used for model inference. Ensure these libraries are compatible with each other to avoid GPU utilization issues.

  4. GPU-Supported Library Versions: Verify that the Python libraries for model inference are installed with GPU support. Libraries like TensorFlow and PyTorch require GPU-specific versions (e.g., tensorflow-gpu) for GPU acceleration.

Regarding the infer_turbo configuration, setting it to False means the application uses a standard worker instead of a VLLMWorker, which is optimized for very large language models. This setting could influence GPU usage and overall performance.

For further details, you can refer to the startup.py file and related issues in the Langchain-Chatchat repository. These resources might provide additional insights into configuring the application for optimal GPU usage.

If you have more questions or need further assistance, feel free to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
Mioooooo commented 3 months ago

@dosu-bot WARNING | transformers_modules.chatglm3-6b.tokenization_chatglm | Setting unk_token is not supported, use the default one 是什么错误

dosubot[bot] commented 3 months ago

The warning "WARNING | transformers_modules.chatglm3-6b.tokenization_chatglm | Setting unk_token is not supported, use the default one" indicates that the tokenizer for the chatglm3-6b model is attempting to set a custom unk_token (unknown token), which is not supported, and it's defaulting to its predefined unk_token. The unk_token is used to represent tokens not found in the tokenizer's vocabulary. This warning typically arises when there's an attempt to customize the tokenizer configuration with an unsupported parameter.

To address this warning, ensure you're not setting the unk_token to a custom value in the tokenizer configuration for the chatglm3-6b model. If you're directly working with the tokenizer configuration or can modify the source code of the libraries you're using, look for any lines attempting to set unk_token and remove or adjust them accordingly.

This warning is likely coming from an external library or framework utilized in your project, as there's no direct reference to the transformers library or explicit tokenizer configuration in the provided context. If the warning is critical, consider reaching out to the maintainers of the relevant library or checking its documentation for guidance on tokenizer configuration.

For further assistance or if the warning persists, providing more information about your setup and the steps leading to the warning could help identify the source of the issue and offer a more accurate solution.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
Hang-shao commented 3 months ago

这是机器人吧

oymclihui commented 3 months ago

我也遇到这个问题 我的显卡是8张 RTX5000每个16G ,也是这个问题,回答一秒要给字输出。显卡使用率是0. 其中setup.py 我修改一下代码 else: from fastchat.serve.model_worker import app, GptqConfig, AWQConfig, ModelWorker, worker_id

        args.gpus = "0,1,2,3,4,5,6,7"  # GPU的编号,如果有多个GPU,可以设置为"0,1,2,3"
        args.max_gpu_memory = "16GiB"
        args.num_gpus = 8  # model worker的切分是model并行,这里填写显卡的数量
oymclihui commented 3 months ago
    # 以下为model_worker非常用参数,可根据需要配置
    "load_8bit": False, # 开启8bit量化
    "cpu_offloading": None,
    "gptq_ckpt": None,
    "gptq_wbits": 16,
    "gptq_groupsize": -1,
    "gptq_act_order": False,
    "awq_ckpt": None,
    "awq_wbits": 16,
    "awq_groupsize": -1,
    "model_names": "chatglm3-6b-32k",
    "conv_template": None,
    "limit_worker_concurrency": 5,
    "stream_interval": 2,
    "no_register": False,
    "embed_in_truncate": False,

    # 以下为vllm_worker配置参数,注意使用vllm必须有gpu,仅在Linux测试通过

    # tokenizer = model_path # 如果tokenizer与model_path不一致在此处添加
    'tokenizer_mode':'auto',
    'trust_remote_code':True,
    'download_dir':None,
    'load_format':'auto',
    'dtype':'auto',
    'seed':0,
    'worker_use_ray':False,
    'pipeline_parallel_size':1,
    'tensor_parallel_size':1,
    'block_size':16,
    'swap_space':4 , # GiB
    'gpu_memory_utilization':0.90,
    'max_num_batched_tokens':2560,
    'max_num_seqs':256,
    'disable_log_stats':False,
    'conv_template':None,
    'limit_worker_concurrency':5,
    'no_register':False,
    'num_gpus': 8,
    'engine_use_ray': False,
    'disable_log_requests': False

},
"chatglm3-6b-32k": {
    "device": "cuda",
},
"chatglm3-6b": {
    "device": "cuda",
},
"Qwen1.5-0.5B-Chat": {
    "device": "cuda",
},
# 以下配置可以不用修改,在model_config中设置启动的模型
"zhipu-api": {
    "port": 21001,
},
"minimax-api": {
    "port": 21002,
},
"xinghuo-api": {
    "port": 21003,
},
"qianfan-api": {
    "port": 21004,
},
"fangzhou-api": {
    "port": 21005,
},
"qwen-api": {
    "port": 21006,
},
"baichuan-api": {
    "port": 21007,
},
"azure-api": {
    "port": 21008,
},
"tiangong-api": {
    "port": 21009,
},
"gemini-api": {
    "port": 21010,
},

}

FSCHAT_CONTROLLER = { "host": DEFAULT_BIND_HOST, "port": 20001, "dispatch_method": "shortest_queue", }

Mioooooo commented 3 months ago

Referen

感谢回复,我这个可能只是单纯的显存不足