[BUG] 同时加载多个模型时，查看当前已加载的模型api时只显示一个本地模型

strLights commented 5 months ago

问题描述 / Problem Description 执行完命令后，加载的三个本地模型，然后只有一个本地模型可用

复现问题的步骤 / Steps to Reproduce

执行 'python startup.py -a'
点击 '7861端口下的api，查看当前加载的模型时，只显示一个baichuan2-7b的模型'

预期的结果 / Expected Result 应该显示三个本地模型

实际结果 / Actual Result 只显示一个baichuan2-7b的模型

环境信息 / Environment Information

langchain-ChatGLM 版本/commit 号：v0.2.9
是否使用 Docker 部署（是/否）：否
使用的模型（ChatGLM2-6B / Qwen-7B 等）：ChatGLM3-6B & baichuan2-7b & Qwen-1_8B-Chat
使用的 Embedding 模型（moka-ai/m3e-base 等）：text2vec-base
使用的向量库类型 (faiss / milvus / pg_vector 等)： faiss
操作系统及版本 / Operating system and version: miniconda
Python 版本 / Python version: 3.11.7
其他相关环境信息 / Other relevant environment information: 2张24G显存的gpu

附加信息 / Additional Information ==============================Langchain-Chatchat Configuration============================== 操作系统：Linux-5.4.0-147-generic-x86_64-with-glibc2.31. python版本：3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] 项目版本：v0.2.9 langchain版本：0.0.352. fastchat版本：0.2.34

当前使用的分词器：ChineseRecursiveTextSplitter 当前启动的LLM模型：['chatglm3-6b', 'Qwen-1_8B-Chat', 'baichuan2-7b', 'xinghuo-api', 'qwen-api'] @ cuda {'device': 'cuda', 'gpus': '0,1', 'host': '0.0.0.0', 'infer_turbo': False, 'max_gpu_memory': '20GiB', 'model_path': '/data/ChatGLM2-6B/model/chatglm3-6b', 'model_path_exists': True, 'num_gpus': 2, 'port': 20002} {'device': 'cuda', 'gpus': '0,1', 'host': '0.0.0.0', 'infer_turbo': False, 'max_gpu_memory': '20GiB', 'model_path': '/data/Ly/Langchain-Chatchat/model/Qwen-1_8B-Chat', 'model_path_exists': True, 'num_gpus': 2, 'port': 20002} {'device': 'cuda', 'gpus': '0,1', 'host': '0.0.0.0', 'infer_turbo': False, 'max_gpu_memory': '20GiB', 'model_path': '/data/Ly/Langchain-Chatchat/model/Baichuan2-7B-Chat', 'model_path_exists': True, 'num_gpus': 2, 'port': 20002} {'APISecret': 'ZTc2YzM5NDA4ODllMTQwYzEwODAzNWU1', 'APPID': '584105b7', 'api_key': '5f9ae5dfb950aba67e64b469e9c818e5', 'device': 'auto', 'gpus': '0,1', 'host': '0.0.0.0', 'infer_turbo': False, 'max_gpu_memory': '20GiB', 'num_gpus': 2, 'online_api': True, 'port': 21003, 'provider': 'XingHuoWorker', 'version': 'v1.5', 'worker_class': <class 'server.model_workers.xinghuo.XingHuoWorker'>} {'api_key': 'sk-eb7f2ae611fb4444bb4eea86f0d64442', 'device': 'auto', 'embed_model': 'text-embedding-v1', 'gpus': '0,1', 'host': '0.0.0.0', 'infer_turbo': False, 'max_gpu_memory': '20GiB', 'num_gpus': 2, 'online_api': True, 'port': 21006, 'provider': 'QwenWorker', 'version': 'qwen-turbo', 'worker_class': <class 'server.model_workers.qwen.QwenWorker'>} 当前Embbedings模型： text2vec-base @ cuda ==============================Langchain-Chatchat Configuration==============================

2024-01-16 11:23:42,325 - startup.py[line:651] - INFO: 正在启动服务： 2024-01-16 11:23:42,325 - startup.py[line:652] - INFO: 如需查看 llm_api 日志，请前往 /data/Ly/chatchat/new-version/v0.2.9/Langchain-Chatchat/logs 2024-01-16 11:23:46 | INFO | model_worker | Register to controller 2024-01-16 11:23:46 | INFO | model_worker | Register to controller 2024-01-16 11:23:46 | ERROR | stderr | INFO: Started server process [3658307] 2024-01-16 11:23:46 | ERROR | stderr | INFO: Waiting for application startup. 2024-01-16 11:23:46 | ERROR | stderr | INFO: Application startup complete. 2024-01-16 11:23:46 | ERROR | stderr | INFO: Uvicorn running on http://0.0.0.0:20000 (Press CTRL+C to quit) 2024-01-16 11:23:47 | INFO | model_worker | Loading the model ['baichuan2-7b'] on worker ebd0164d ... 2024-01-16 11:23:47 | INFO | model_worker | Loading the model ['chatglm3-6b'] on worker 6468aca7 ... 2024-01-16 11:23:47 | INFO | model_worker | Loading the model ['Qwen-1_8B-Chat'] on worker 25a67bb9 ... 2024-01-16 11:23:47 | WARNING | transformers_modules.Qwen-1_8B-Chat.modeling_qwen | Try importing flash-attention for faster inference... 2024-01-16 11:23:47 | WARNING | transformers_modules.Qwen-1_8B-Chat.modeling_qwen | Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary 2024-01-16 11:23:47 | WARNING | transformers_modules.Qwen-1_8B-Chat.modeling_qwen | Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm 2024-01-16 11:23:47 | WARNING | transformers_modules.Qwen-1_8B-Chat.modeling_qwen | Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 14%|████████▋ | 1/7 [00:00<00:03, 1.66it/s] Loading checkpoint shards: 50%|██████████████████████████████▌ | 1/2 [00:00<00:00, 1.62it/s] Loading checkpoint shards: 29%|█████████████████▍ | 2/7 [00:01<00:02, 1.89it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.55it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.56it/s] 2024-01-16 11:23:49 | ERROR | stderr | Loading checkpoint shards: 43%|██████████████████████████▏ | 3/7 [00:01<00:02, 1.99it/s] 2024-01-16 11:23:49 | INFO | model_worker | Register to controller Loading checkpoint shards: 57%|██████████████████████████████████▊ | 4/7 [00:01<00:01, 2.10it/s] Loading checkpoint shards: 71%|███████████████████████████████████████████▌ | 5/7 [00:02<00:01, 1.82it/s] Loading checkpoint shards: 86%|████████████████████████████████████████████████████▎ | 6/7 [00:03<00:00, 1.98it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████| 7/7 [00:03<00:00, 2.29it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████| 7/7 [00:03<00:00, 2.07it/s] 2024-01-16 11:23:51 | ERROR | stderr | 2024-01-16 11:23:51 | INFO | model_worker | Register to controller ERROR: [Errno 98] error while attempting to bind on address ('0.0.0.0', 20002): address already in use 2024-01-16 11:24:05 | INFO | model_worker | Register to controller ERROR: [Errno 98] error while attempting to bind on address ('0.0.0.0', 20002): address already in use INFO: Started server process [3659487] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:7861 (Press CTRL+C to quit)

看到报错本地模型的端口20002被占用导致其余本地模型加载不成功，请问怎么配置多个本地模型的端口配置？

dosubot[bot] commented 5 months ago

🤖

你好 @strLights，我是Dosu，一个可以帮助你解决问题，回答问题，甚至成为贡献者的机器人。虽然我是一个机器人，但我在这里为你提供帮助，同时等待人类的维护者参与。我很高兴能为你提供帮助！

根据我在Langchain-Chatchat代码库中的查找，每个模型的端口号是在脚本末尾的主函数中分配的。在提供的上下文中，当调用uvicorn.run()函数时，端口号被分配给QianFanWorker模型。

这是相关的代码：

if __name__ == "__main__":
    import uvicorn
    from server.utils import MakeFastAPIOffline
    from fastchat.serve.model_worker import app
    worker = QianFanWorker(
        controller_addr="http://127.0.0.1:20001",
        worker_addr="http://127.0.0.1:21004"
    )
    sys.modules["fastchat.serve.model_worker"].worker = worker
    MakeFastAPIOffline(app)
    uvicorn.run(app, port=21004)

在这种情况下，端口号21004被分配给QianFanWorker模型。uvicorn.run()函数用于在指定的端口上启动一个Uvicorn服务器。

然而，要为Langchain-Chatchat中的多个本地模型配置端口，您需要修改服务器配置文件（server_config.py.example）。但是，提供的上下文并未包含服务器配置文件，这是提供详细解决方案所必需的。

因此，我无法根据给定的上下文提供完整的答案。请提供服务器配置文件或其内容以获得更准确的解决方案。

参考资料：

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

zRzRzRzRzRzRzR commented 5 months ago

serve_config要配

chatchat-space / Langchain-Chatchat

[BUG] 同时加载多个模型时，查看当前已加载的模型api时只显示一个本地模型 #2674

Sources