[BUG] 多并发调用，偶尔出现因为embedding没加载而调用失败

sweetautumn commented 5 months ago

问题描述 / Problem Description 开启vllm加速启动服务后：多并发调用，则会因为embedding没加载而调用失败

复现问题的步骤 / Steps to Reproduce 1.设置vllm加速： FSCHAT_MODEL_WORKERS = {

"default": { "host": DEFAULT_BIND_HOST, "port": 30002, "device": LLM_DEVICE, "infer_turbo": 'vllm',

"max_parallel_loading_workers":3,
"enforce_eager":False,
"max_context_len_to_capture":2048,
"max_model_len":2048,

# model_worker多卡加载需要配置的参数
# "gpus": None, # 使用的GPU，以str的格式指定，如"0,1"，如失效请使用CUDA_VISIBLE_DEVICES="0,1"等形式指定
# "num_gpus": 1, # 使用GPU的数量
# "max_gpu_memory": "20GiB", # 每个GPU占用的最大显存

# 以下为model_worker非常用参数，可根据需要配置
# "load_8bit": False, # 开启8bit量化
# "cpu_offloading": None,
# "gptq_ckpt": None,
# "gptq_wbits": 16,
# "gptq_groupsize": -1,
# "gptq_act_order": False,
# "awq_ckpt": None,
# "awq_wbits": 16,
# "awq_groupsize": -1,
# "model_names": LLM_MODELS,
# "conv_template": None,
# "limit_worker_concurrency": 5,
# "stream_interval": 2,
# "no_register": False,
# "embed_in_truncate": False,

# 以下为vllm_worker配置参数,注意使用vllm必须有gpu，仅在Linux测试通过

# tokenizer = model_path # 如果tokenizer与model_path不一致在此处添加
'tokenizer_mode':'auto',
'trust_remote_code':True,
'download_dir':None,
'load_format':'auto',
'dtype':'auto',
'seed':0,
'worker_use_ray':False,
'pipeline_parallel_size':1,
'tensor_parallel_size':1,
'block_size':16,
'swap_space':4 , # GiB
'gpu_memory_utilization':0.80,
'max_num_batched_tokens':2560,
'max_num_seqs':256,
'disable_log_stats':False,
'conv_template':None,
'limit_worker_concurrency':3,
'no_register':False,
'num_gpus': 1,
'engine_use_ray': False,
'disable_log_requests': False

},

2.启动服务： python startup.py -a

3.python代码多并发调用

预期的结果 / Expected Result 正常返回生成的答案

实际结果 / Actual Result 多并发请求服务，偶尔能正常执行完，偶尔会部分正常执行，部分报错： AttributeError: 'NoneType' object has no attribute 'acquire' 具体报错信息： 2024-04-26 07:13:11,178 - _client.py[line:1758] - INFO: HTTP Request: POST http://127.0.0.1:30000/v1/chat/completions "HTTP/1.1 200 OK" ERROR: Exception in ASGI application Traceback (most recent call last): File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/sse_starlette/sse.py", line 269, in call await wrap(partial(self.listen_for_disconnect, receive)) File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/sse_starlette/sse.py", line 258, in wrap await func() File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/sse_starlette/sse.py", line 215, in listen_for_disconnect message = await receive() ^^^^^^^^^^^^^^^ File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive await self.message_event.wait() File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/asyncio/locks.py", line 213, in wait await fut asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f1478689710

During handling of the above exception, another exception occurred:

Exception Group Traceback (most recent call last): | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi | result = await app( # type: ignore[func-returns-value] | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in call | return await self.app(scope, receive, send) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in call | await super().call(scope, receive, send) | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/applications.py", line 119, in call | await self.middleware_stack(scope, receive, send) | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in call | raise exc | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in call | await self.app(scope, receive, _send) | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in call | await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app | raise exc | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app | await app(scope, receive, sender) | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/routing.py", line 762, in call | await self.middleware_stack(scope, receive, send) | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/routing.py", line 782, in app | await route.handle(scope, receive, send) | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle | await self.app(scope, receive, send) | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/routing.py", line 77, in app | await wrap_app_handling_exceptions(app, request)(scope, receive, send) | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app | raise exc | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app | await app(scope, receive, sender) | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/routing.py", line 75, in app | await response(scope, receive, send) | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/sse_starlette/sse.py", line 255, in call | async with anyio.create_task_group() as task_group: | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 678, in aexit | raise BaseExceptionGroup( | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) +-+---------------- 1 ---------------- | Traceback (most recent call last): | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/sse_starlette/sse.py", line 258, in wrap | await func() | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/sse_starlette/sse.py", line 245, in stream_response | async for data in self.body_iterator: | File "/home/algo/workproject/Langchain-Chatchat-gen-v1.0.1/server/chat/knowledge_base_chat.py", line 109, in knowledge_base_chat_iterator | docs = await run_in_threadpool(search_docs, | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/starlette/concurrency.py", line 40, in run_in_threadpool | return await anyio.to_thread.run_sync(func, args) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync | return await get_async_backend().run_sync_in_worker_thread( | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread | return await future | ^^^^^^^^^^^^ | File "/home/algo/anaconda3/envs/chatfaqgpu/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 851, in run | result = context.run(func, args) | ^^^^^^^^^^^^^^^^^^^^^^^^ | File "/home/algo/workproject/Langchain-Chatchat-gen-v1.0.1/server/knowledge_base/kb_doc_api.py", line 38, in search_docs | docs = kb.search_docs(query, top_k, score_threshold) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/home/algo/workproject/Langchain-Chatchat-gen-v1.0.1/server/knowledge_base/kb_service/base.py", line 182, in search_docs | docs = self.do_search(query, top_k, score_threshold) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/home/algo/workproject/Langchain-Chatchat-gen-v1.0.1/server/knowledge_base/kb_service/faiss_kb_service.py", line 67, in do_search | with self.load_vector_store().acquire() as vs: | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | AttributeError: 'NoneType' object has no attribute 'acquire'

环境信息 / Environment Information

langchain-ChatGLM 版本/commit 号：v0.2.10
是否使用 Docker 部署（是/否）：否
使用的模型（ChatGLM2-6B / Qwen-7B 等）：ChatGLM3-6B
使用的 Embedding 模型（moka-ai/m3e-base 等）：bge-m3
使用的向量库类型 (faiss / milvus / pg_vector 等)： faiss
操作系统及版本 / Operating system and version: ubuntu
Python 版本 / Python version: 3.11.7
其他相关环境信息 / Other relevant environment information:

附加信息 / Additional Information 添加与问题相关的任何其他信息 / Add any other information related to the issue.

dosubot[bot] commented 5 months ago

The error you're encountering, "AttributeError: 'NoneType' object has no attribute 'acquire'", indicates that the ThreadSafeFaiss object is not properly initialized at the time you're trying to acquire a lock on it. This problem arises when the load_vector_store method returns None instead of a valid ThreadSafeFaiss instance. To address this issue, ensure that the embeddings or vector stores are properly initialized and loaded before they are accessed. Here's a more focused approach to resolve the issue:

Check Initialization and Loading: Confirm that the vector stores are correctly initialized and loaded before any access attempts. This involves ensuring that the load_vector_store method always returns a fully initialized ThreadSafeFaiss instance.
Implement Error Handling and Logging: Add error handling and logging within the initialization and loading processes. This will help identify and troubleshoot issues more effectively.
Ensure Object Readiness with wait_for_loading: Before using the embeddings or vector store, utilize the wait_for_loading method to ensure the object is fully loaded and ready for use.
Review Thread Safety and Concurrency: Given the concurrent nature of your application, it's crucial to ensure that all operations on shared resources are thread-safe to avoid race conditions or deadlocks.

By following these steps, you should be able to mitigate the intermittent failures related to embeddings not being loaded properly when handling multiple concurrent requests with vllm acceleration.

#### Details

**Sources** - [使用VLLM加速之后就这样了，知识库查询有结果，但是模型并没有返回结果](https://github.com/chatchat-space/Langchain-Chatchat/issues/3511) - [server/knowledge_base/kb_service/faiss_kb_service.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/server/knowledge_base/kb_service/faiss_kb_service.py) - [server/knowledge_base/kb_cache/faiss_cache.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/server/knowledge_base/kb_cache/faiss_cache.py) - [server/knowledge_base/kb_cache/base.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/server/knowledge_base/kb_cache/base.py) - [startup.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/startup.py) - [configs/server_config.py.example](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/configs/server_config.py.example)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

zRzRzRzRzRzRzR commented 5 months ago

这个框架没做并发啊

wzhty86 commented 4 months ago

这个框架没做并发啊

有计划做并发处理吗？

chatchat-space / Langchain-Chatchat

[BUG] 多并发调用，偶尔出现因为embedding没加载而调用失败 #3899