chatchat-space / Langchain-Chatchat

Langchain-Chatchat(原Langchain-ChatGLM)基于 Langchain 与 ChatGLM, Qwen 与 Llama 等语言模型的 RAG 与 Agent 应用 | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like ChatGLM, Qwen and Llama) RAG and Agent app with langchain
Apache License 2.0
30.8k stars 5.38k forks source link

想请教大佬为什么运行速度这么快 #3399

Closed litongyun233 closed 5 months ago

litongyun233 commented 5 months ago

单独测试chatGLM2-6b模型(不涉及RAG,embedding等等,即简单的输入prompt调用模型),在1200个token的时候回答要4s,而在3800个token的时候,回答要28s。但是使用本项目,不论token多长(即不论设置的匹配知识条数有多少条),模型总是能在1s内流式输出结果,请问这是做了什么优化能让模型的回答速度提升这么多呢? 本来以为是上下文压缩减少了token数量,但看到只有在rerank阶段涉及到上下文压缩,但不启用rerank也有如此快的速度。

dosubot[bot] commented 5 months ago

嘿,@litongyun233,又见面了!希望这次我们能一起解决新的疑惑。😜

Langchain-Chatchat项目中实现了多种优化,以确保模型在1秒内输出结果,无论token长度或设置的知识匹配数有多少。这些优化包括:

  1. 异步处理:使用异步函数(async/await)和asyncio任务进行非阻塞I/O操作。这在knowledge_base_chat_iterator函数中尤为明显,其中数据库和网络操作是异步执行的(例如,await run_in_threadpool(search_docs, ...)),允许其他任务并行运行,而无需等待这些操作完成。

  2. 后台任务执行:使用asyncio.create_task方法在后台运行wrap_done(chain.acall({"context": context, "question": query}), callback.done)任务。这允许系统启动任务并立即执行后续代码,而无需等待任务完成,从而提高响应时间。

  3. 服务器发送事件(SSE):使用sse_starlette.sse中的EventSourceResponse将响应流式传输到客户端。这允许服务器在数据可用时立即将更新推送到客户端,而无需等待整个响应准备就绪。这对于流式传输模型的输出特别有用,可以提高性能感知。

  4. 条件使用重排器:基于USE_RERANKER配置选项的条件使用重排器,可以灵活处理。如果启用了重排,它会根据查询的相关性压缩文档,可能减少模型需要考虑的数据量,从而加快响应时间。这由if USE_RERANKER:块管理,只有在配置为这样做时才激活重排器。

  5. 高效的文档处理:在将文档输入模型之前,它们会被处理和压缩(如果使用了重排),以确保只考虑最相关的内容。这在docs = reranker_model.compress_documents(documents=docs, query=query)行中可以看到,该行通过优化输入以便模型更快地处理。

  6. 优化的提示构建:通过使用模板并只包含最相关的历史和上下文,优化了提示的构建。这由ChatPromptTemplate.from_messages方法管理,该方法以结构化和高效的方式构建提示,确保模型的输入简洁且相关。

这些优化共同促成了系统即使在处理复杂查询或大量文档时也能快速响应的能力。

源代码:

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
litongyun233 commented 5 months ago

但是在不适用rerank的上下文压缩时,速率也很快,是为什么呢@dosu-bot

dosubot[bot] commented 5 months ago

The Langchain-Chatchat project maintains a high response rate through a combination of strategies that optimize the processing and response generation without solely relying on rerank context compression. Here's how it achieves this:

  1. Asynchronous Processing and Background Task Execution: The project leverages asynchronous processing (async/await syntax) and background tasks (e.g., asyncio.create_task) to handle I/O-bound operations, such as document searching and language model inference, without blocking the main execution thread. This allows the server to handle other requests or tasks concurrently, improving overall throughput and responsiveness.

  2. Server-Sent Events (SSE): By using SSE (EventSourceResponse from sse_starlette.sse), the project can stream responses back to the client as soon as data is available. This approach is particularly useful for delivering partial responses incrementally, enhancing the perceived response time from the user's perspective.

  3. Conditional Use of Reranker Based on Configuration: The project conditionally applies document reranking (if USE_RERANKER:) based on the USE_RERANKER configuration option. This flexibility allows the system to bypass potentially time-consuming reranking processes when not needed, speeding up response times under certain configurations.

  4. Efficient Document Processing: Before potentially reranking documents, the project efficiently searches and selects top-k relevant documents from the knowledge base (search_docs). This initial filtering reduces the amount of data that needs to be processed in later stages, whether for reranking or direct response generation.

  5. Optimized Prompt Construction: The project constructs prompts for the language model in a structured manner, using templates (ChatPromptTemplate.from_messages) and efficiently concatenating historical and current query information. This structured approach ensures that the prompts are concise and relevant, which can help in reducing the processing time required by the language model.

  6. Use of Callbacks for Streaming: The project employs an AsyncIteratorCallbackHandler to facilitate streaming of the language model's output back to the client. This mechanism allows the system to yield intermediate results as they become available, rather than waiting for the entire response to be generated.

These strategies collectively contribute to the high response rate of the Langchain-Chatchat project by optimizing the end-to-end processing pipeline, from document retrieval and processing to response generation and delivery. You can find more details in the source code, specifically in the knowledge_base_chat.py and chat.py files.

Regarding the efficient document processing, Langchain-Chatchat employs several techniques and algorithms to efficiently search and select top-k relevant documents from the knowledge base. It uses API wrappers for Bing and DuckDuckGo to perform searches. For the Metaphor search engine, it performs a search using the Metaphor client, which is initialized with an API key. The search results are then processed to convert HTML to markdown using markdownify. If split_result is True, it employs a RecursiveCharacterTextSplitter to split the long text documents into smaller chunks based on predefined delimiters and a chunk size. After splitting, if the number of documents exceeds the desired result_len, it calculates the similarity score between the query text and each document's content using the NormalizedLevenshtein algorithm. This similarity score is used to sort the documents, and only the top-k documents (as specified by result_len) are selected. This step ensures that the most relevant documents are chosen based on textual similarity to the query. You can find more details in the search_engine_chat.py file.

I hope this explanation helps you understand how Langchain-Chatchat maintains high speed without the use of rerank context compression. If you have any more questions, feel free to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
zRzRzRzRzRzRzR commented 5 months ago

首token和总体没关系,而且有可能是你传入的maxtoken是1024,后面的内容都没看,项目默认的max token为1024