chatchat-space / Langchain-Chatchat

Langchain-Chatchat(原Langchain-ChatGLM)基于 Langchain 与 ChatGLM, Qwen 与 Llama 等语言模型的 RAG 与 Agent 应用 | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like ChatGLM, Qwen and Llama) RAG and Agent app with langchain
Apache License 2.0
31.6k stars 5.51k forks source link

[BUG] 多并发调用,偶尔出现因为embedding没加载而调用失败 #3899

Closed sweetautumn closed 5 months ago

sweetautumn commented 5 months ago

问题描述 / Problem Description 开启vllm加速启动服务后: 多并发调用,则会因为embedding没加载而调用失败

复现问题的步骤 / Steps to Reproduce 1.设置vllm加速: FSCHAT_MODEL_WORKERS = {

"default": { "host": DEFAULT_BIND_HOST, "port": 30002, "device": LLM_DEVICE, "infer_turbo": 'vllm',

"max_parallel_loading_workers":3,
"enforce_eager":False,
"max_context_len_to_capture":2048,
"max_model_len":2048,

# model_worker多卡加载需要配置的参数
# "gpus": None, # 使用的GPU,以str的格式指定,如"0,1",如失效请使用CUDA_VISIBLE_DEVICES="0,1"等形式指定
# "num_gpus": 1, # 使用GPU的数量
# "max_gpu_memory": "20GiB", # 每个GPU占用的最大显存

# 以下为model_worker非常用参数,可根据需要配置
# "load_8bit": False, # 开启8bit量化
# "cpu_offloading": None,
# "gptq_ckpt": None,
# "gptq_wbits": 16,
# "gptq_groupsize": -1,
# "gptq_act_order": False,
# "awq_ckpt": None,
# "awq_wbits": 16,
# "awq_groupsize": -1,
# "model_names": LLM_MODELS,
# "conv_template": None,
# "limit_worker_concurrency": 5,
# "stream_interval": 2,
# "no_register": False,
# "embed_in_truncate": False,

# 以下为vllm_worker配置参数,注意使用vllm必须有gpu,仅在Linux测试通过

# tokenizer = model_path # 如果tokenizer与model_path不一致在此处添加
'tokenizer_mode':'auto',
'trust_remote_code':True,
'download_dir':None,
'load_format':'auto',
'dtype':'auto',
'seed':0,
'worker_use_ray':False,
'pipeline_parallel_size':1,
'tensor_parallel_size':1,
'block_size':16,
'swap_space':4 , # GiB
'gpu_memory_utilization':0.80,
'max_num_batched_tokens':2560,
'max_num_seqs':256,
'disable_log_stats':False,
'conv_template':None,
'limit_worker_concurrency':3,
'no_register':False,
'num_gpus': 1,
'engine_use_ray': False,
'disable_log_requests': False

},

2.启动服务: python startup.py -a

3.python代码多并发调用

预期的结果 / Expected Result 正常返回生成的答案

实际结果 / Actual Result 多并发请求服务,偶尔能正常执行完,偶尔会部分正常执行,部分报错: AttributeError: 'NoneType' object has no attribute 'acquire' 具体报错信息: 2024-04-26 07:13:11,178 - _client.py[line:1758] - INFO: HTTP Request: POST http://127.0.0.1:30000/v1/chat/completions "HTTP/1.1 200 OK" ERROR: Exception in ASGI application Traceback (most recent call last): File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/sse_starlette/sse.py", line 269, in call await wrap(partial(self.listen_for_disconnect, receive)) File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/sse_starlette/sse.py", line 258, in wrap await func() File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/sse_starlette/sse.py", line 215, in listen_for_disconnect message = await receive() ^^^^^^^^^^^^^^^ File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive await self.message_event.wait() File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/asyncio/locks.py", line 213, in wait await fut asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f1478689710

During handling of the above exception, another exception occurred:

环境信息 / Environment Information

附加信息 / Additional Information 添加与问题相关的任何其他信息 / Add any other information related to the issue.

dosubot[bot] commented 5 months ago

The error you're encountering, "AttributeError: 'NoneType' object has no attribute 'acquire'", indicates that the ThreadSafeFaiss object is not properly initialized at the time you're trying to acquire a lock on it. This problem arises when the load_vector_store method returns None instead of a valid ThreadSafeFaiss instance. To address this issue, ensure that the embeddings or vector stores are properly initialized and loaded before they are accessed. Here's a more focused approach to resolve the issue:

  1. Check Initialization and Loading: Confirm that the vector stores are correctly initialized and loaded before any access attempts. This involves ensuring that the load_vector_store method always returns a fully initialized ThreadSafeFaiss instance.

  2. Implement Error Handling and Logging: Add error handling and logging within the initialization and loading processes. This will help identify and troubleshoot issues more effectively.

  3. Ensure Object Readiness with wait_for_loading: Before using the embeddings or vector store, utilize the wait_for_loading method to ensure the object is fully loaded and ready for use.

  4. Review Thread Safety and Concurrency: Given the concurrent nature of your application, it's crucial to ensure that all operations on shared resources are thread-safe to avoid race conditions or deadlocks.

By following these steps, you should be able to mitigate the intermittent failures related to embeddings not being loaded properly when handling multiple concurrent requests with vllm acceleration.

#### Details **Sources** - [使用VLLM加速之后就这样了,知识库查询有结果,但是模型并没有返回结果](https://github.com/chatchat-space/Langchain-Chatchat/issues/3511) - [server/knowledge_base/kb_service/faiss_kb_service.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/server/knowledge_base/kb_service/faiss_kb_service.py) - [server/knowledge_base/kb_cache/faiss_cache.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/server/knowledge_base/kb_cache/faiss_cache.py) - [server/knowledge_base/kb_cache/base.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/server/knowledge_base/kb_cache/base.py) - [startup.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/startup.py) - [configs/server_config.py.example](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/configs/server_config.py.example)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

zRzRzRzRzRzRzR commented 5 months ago

这个框架没做并发啊

wzhty86 commented 4 months ago

这个框架没做并发啊

有计划做并发处理吗?