[FEATURE] 单段文本最大长度能不能大于1000个字符，在哪里设置，knowledge_base.py里改成3000，前端显示还是限制为1000

chatchat-space / Langchain-Chatchat

Langchain-Chatchat（原Langchain-ChatGLM）基于 Langchain 与 ChatGLM, Qwen 与 Llama 等语言模型的 RAG 与 Agent 应用 | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like ChatGLM, Qwen and Llama) RAG and Agent app with langchain

Apache License 2.0

32.04k stars 5.57k forks source link

[FEATURE] 单段文本最大长度能不能大于1000个字符，在哪里设置，knowledge_base.py里改成3000，前端显示还是限制为1000 #2501

Closed lvxian101 closed 10 months ago

lvxian101 commented 10 months ago

1703758247231

dosubot[bot] commented 10 months ago

🤖

你好，@lvxian101！我是Dosu, 一个专门帮助你解决问题，回答相关疑问和成为贡献者的机器人。在等待人类维护者的同时，我在此为你提供帮助。很高兴与你交流！

在Langchain-Chatchat中，单段文本的最大长度可以在knowledge_base.py文件的knowledge_base_page函数中设置。具体来说，它在以下这行代码中设置：

chunk_size = cols[0].number_input("单段文本最大长度：", 1, 1000, CHUNK_SIZE)

number_input函数的第三个参数是允许的最大值。在这种情况下，它被设置为1000。如果你想将最大字符限制增加到3000，你可以将这行改为：

chunk_size = cols[0].number_input("单段文本最大长度：", 1, 3000, CHUNK_SIZE)

然而，如果在做出这个更改后，前端仍然显示1000的限制，那么可能前端代码也设置了1000的限制。你需要找到前端代码中设置此限制的地方，并在那里也增加它。没有前端代码，我无法提供更具体的指导。

关于你的问题，我在Langchain-Chatchat的存储库中找到了一个类似的已解决问题，你可以参考一下：请问生成文本长度限制在哪设置。

如果你在前端代码中找不到设置最大字符限制的地方，或者有其他问题，欢迎继续提问。

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

lvxian101 commented 10 months ago

@dosu-bot 我就是用的你的方法，但是仍然提示最大长度不得超过1000

dosubot[bot] commented 10 months ago

🤖

The maximum character limit for a single text segment in the frontend of Langchain-Chatchat is likely controlled by the max_tokens parameter in the file_chat and search_engine_chat methods in the webui_pages/utils.py file.

To increase the maximum character limit, you can modify the max_tokens parameter in these methods. Currently, it is set to None, which means it is using the default value set elsewhere in the code or the model's maximum limit. You can set it to a specific integer to limit the number of tokens.

Here is how you can do it:

def file_chat(
        self,
        query: str,
        knowledge_id: str,
        top_k: int = VECTOR_SEARCH_TOP_K,
        score_threshold: float = SCORE_THRESHOLD,
        history: List[Dict] = [],
        stream: bool = True,
        model: str = LLM_MODELS[0],
        temperature: float = TEMPERATURE,
        max_tokens: int = 3000,  # Increase this value as needed
        prompt_name: str = "default",
    ):
    ...

def search_engine_chat(
        self,
        query: str,
        search_engine_name: str,
        top_k: int = SEARCH_ENGINE_TOP_K,
        history: List[Dict] = [],
        stream: bool = True,
        model: str = LLM_MODELS[0],
        temperature: float = TEMPERATURE,
        max_tokens: int = 3000,  # Increase this value as needed
        prompt_name: str = "default",
        split_result: bool = False,
    ):
    ...

Please note that increasing the max_tokens parameter may affect the performance and memory usage of the application, especially for very large values. You can find these methods in the webui_pages/utils.py file.

As for a separate frontend codebase where the character limit could be set, I wasn't able to find an answer within the repository. It's possible that the answer may be available elsewhere or I could have missed it. If you have more information about this, it would be helpful in providing a more accurate answer.

Sources

webui_pages/utils.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

lvxian101 commented 10 months ago

@dosu-bot是将文档向量化时限制的单段文本最大长度的修改方法，也就是向量库里单个文件的最大长度。不是chat输出是的最大文本长度

dosubot[bot] commented 10 months ago

🤖

To increase the maximum text length of the chat output in Langchain-Chatchat, you need to modify the max_tokens parameter in the knowledge_base_chat function. This parameter limits the number of tokens generated by the LLM model in a single response.

Here is how you can modify it:

async def knowledge_base_chat(query: str = Body(..., description="用户输入", examples=["你好"]),
                              ...
                              max_tokens: Optional[int] = Body(
                                  1000,  # Increase this value to increase the maximum text length of the chat output
                                  description="限制LLM生成Token数量，默认None代表模型最大值"
                              ),
                              ...
                              ):
    ...

Please replace 1000 with the desired maximum number of tokens.

Please note that the max_tokens parameter does not affect the maximum length of a single file in the vector library during document vectorization, nor does it affect the total length of the chat history. If you want to increase the maximum length of a single file in the vector library during document vectorization, you would need to modify the document vectorization process, which is not shown in the provided context.

The default value of the max_tokens parameter in the knowledge_base_chat function in the Langchain-Chatchat codebase is None, which means the model's maximum value is used.

You can find more details in the server/chat/knowledge_base_chat.py file in the Langchain-Chatchat repository.

Sources

server/chat/knowledge_base_chat.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

lvxian101 commented 10 months ago

@dosu-bot是将文档向量化时限制的单段文本最大长度的修改方法，也就是向量库里单个文件的最大长度

dosubot[bot] commented 10 months ago

🤖

To increase the maximum text length for a single file from 1000 to 3000 characters during document vectorization in Langchain-Chatchat, you need to modify the chunk_size parameter in the knowledge_base_page function. This function is located in the webui_pages/knowledge_base/knowledge_base.py file.

Here is the part of the code you need to modify:

with st.expander(
        "文件处理配置",
        expanded=True,
):
    cols = st.columns(3)
    chunk_size = cols[0].number_input("单段文本最大长度：", 1, 3000, CHUNK_SIZE)
    chunk_overlap = cols[1].number_input("相邻文本重合长度：", 0, chunk_size, OVERLAP_SIZE)
    cols[2].write("")
    cols[2].write("")
    zh_title_enhance = cols[2].checkbox("开启中文标题加强", ZH_TITLE_ENHANCE)

In the above code, the number_input function is used to create a number input widget in the Streamlit app. The first argument is the label of the widget, the second and third arguments are the minimum and maximum values that the user can input, and the fourth argument is the default value. By changing the third argument from 1000 to 3000, you allow the user to input a maximum text length of up to 3000 characters.

Please note that this change will only affect the maximum text length that the user can input in the frontend. If you want to change the default value as well, you need to modify the CHUNK_SIZE variable in the configs module.

This solution is based on a similar issue that was solved in the past: 知识库召回文本最大字数好像设置不了.

I hope this helps! If you have any other questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

lvxian101 commented 10 months ago

🤖

你好，@lvxian101！我是Dosu, 一个专门帮助你解决问题，回答相关疑问和成为贡献者的机器人。在等待人类维护者的同时，我在此为你提供帮助。很高兴与你交流！

在Langchain-Chatchat中，单段文本的最大长度可以在knowledge_base.py文件的knowledge_base_page函数中设置。具体来说，它在以下这行代码中设置：
chunk_size = cols[0].number_input("单段文本最大长度：", 1, 1000, CHUNK_SIZE)
number_input函数的第三个参数是允许的最大值。在这种情况下，它被设置为1000。如果你想将最大字符限制增加到3000，你可以将这行改为：
chunk_size = cols[0].number_input("单段文本最大长度：", 1, 3000, CHUNK_SIZE)
然而，如果在做出这个更改后，前端仍然显示1000的限制，那么可能前端代码也设置了1000的限制。你需要找到前端代码中设置此限制的地方，并在那里也增加它。没有前端代码，我无法提供更具体的指导。

关于你的问题，我在Langchain-Chatchat的存储库中找到了一个类似的已解决问题，你可以参考一下：请问生成文本长度限制在哪设置。

如果你在前端代码中找不到设置最大字符限制的地方，或者有其他问题，欢迎继续提问。

Sources

请问生成文本长度限制在哪设置。

webui_pages/knowledge_base/knowledge_base.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

已解决，这个方案可行，我当时可能没有把代码上传更新

dosubot[bot] commented 10 months ago

🤖

很高兴听到方案对你有帮助！如果你需要任何其他帮助或有其他问题，都可以随时提问。祝你工作顺利！

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

jhw0510 commented 10 months ago

@lvxian101 您好，我想问问，我修改了单段文本最大长度为1000，知识库文件应该就是按照1000进行分块拆分对吧，但是我在知识库问答中，模型显示的知识库匹配结果还是近似为这个项目默认的250，我不知道哪里出了问题，这是我的图：

lvxian101 commented 10 months ago

你要把向量库中原来的向量块删除

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年1月4日(星期四) 凌晨1:07 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [chatchat-space/Langchain-Chatchat] [FEATURE] 单段文本最大长度能不能大于1000个字符，在哪里设置，knowledge_base.py里改成3000，前端显示还是限制为1000 (Issue #2501)

@lvxian101 您好，我想问问，我修改了单段文本最大长度为1000，知识库文件应该就是按照1000进行分块拆分对吧，但是我在知识库问答中，模型显示的知识库匹配结果还是近似为这个项目默认的250，我不知道哪里出了问题，这是我的图： image.png (view on web)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

jhw0510 commented 10 months ago

你要把向量库中原来的向量块删除 … ------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年1月4日(星期四) 凌晨1:07 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [chatchat-space/Langchain-Chatchat] [FEATURE] 单段文本最大长度能不能大于1000个字符，在哪里设置，knowledge_base.py里改成3000，前端显示还是限制为1000 (Issue #2501) @lvxian101 您好，我想问问，我修改了单段文本最大长度为1000，知识库文件应该就是按照1000进行分块拆分对吧，但是我在知识库问答中，模型显示的知识库匹配结果还是近似为这个项目默认的250，我不知道哪里出了问题，这是我的图： image.png (view on web) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

感谢回复，上个礼拜已经解决啦，是项目里面的一个小bug，具体可以看 #2561 和 #2564