知识库问答—LLM模型求推荐

zixiaotan21 commented 4 months ago

问题描述 / Problem Description 进行知识库问答时，本地部署qwen-1.5-7B-chat, qwen-1.5-14B-chat，回答效果不好，使用qwen-1.5-72B-chat 运行又太慢了。

想请教一下各位大佬，有好用的LLM模型推荐不？

dosubot[bot] commented 4 months ago

对于知识库问答的LLM模型选择，除了你已尝试的qwen系列外，还有多种模型可供选择，这些模型可能会在性能和运行速度上更适合你的需求。推荐尝试以下几个模型：

"internlm2-chat-20b" 来自 internlm 项目。
"Orion-14B-Chat-Plugin" 来自 OrionStarAI。
"Yi-34B-Chat" 来自 01-ai。
"falcon-40b" 来自 tiiuae。
"agentlm-70b" 来自 THUDM。
"baichuan2-13b-chat" 来自 baichuan-inc。
"open_llama_13b" 来自 openlm-research。
"vicuna-13b-v1.5" 来自 lmsys。
"mpt-30b" 来自 mosaicml。
"opt-66b" 和 "opt-iml-max-30b" 来自 Facebook。

为了提高模型的运行速度，可以考虑优化配置设置。例如，通过设置LLM_DEVICE为"auto"或"cuda"（如果有NVIDIA GPU），可以让模型在最佳硬件上运行。此外，调整MAX_TOKENS参数可以根据你的具体需求和硬件能力来优化性能，减少token数量可以加快响应时间。

希望这些建议能帮助你找到适合你需求的LLM模型。

#### Details

**Sources** - [[BUG] 如何使用chatchat正确配置opanai LLM models](https://github.com/chatchat-space/Langchain-Chatchat/issues/2418) - [关于如何提高知识库问答的答复质量和准确度，请各位大神有什么好的经验分享一下吧。](https://github.com/chatchat-space/Langchain-Chatchat/issues/844) - [你好，请教个问题，目前问答回复需要20秒左右，如何提高速度？V10032G服务器。](https://github.com/chatchat-space/Langchain-Chatchat/issues/493) - [[FEATURE] 建议更换默认模型为qwen-chat-7b，以支持更大的上下文token数](https://github.com/chatchat-space/Langchain-Chatchat/issues/1184) - [configs/model_config.py.example](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/configs/model_config.py.example)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

zixiaotan21 commented 4 months ago

@dosu

MAX_TOKENS参数如何设置

dosubot[bot] commented 4 months ago

To set the MAX_TOKENS parameter effectively, balancing between response time and answer quality, consider the following:

Model and Hardware Limits: Choose a MAX_TOKENS value that your hardware can handle efficiently, especially if using larger models like qwen-1.5-14B-chat. Smaller models or better hardware allow for higher MAX_TOKENS.
Response Time vs. Quality:
- For quicker responses, reduce MAX_TOKENS. This decreases computation but may limit response depth.
- For more detailed answers, increase MAX_TOKENS. This requires more processing time.
Optimization Techniques: Use model optimization strategies like quantization and pruning to improve processing times without greatly affecting quality.
Dynamic Adjustment: Implement logic to adjust MAX_TOKENS based on query complexity or system load, optimizing for performance or detail as needed.
Monitoring and Adjustments: Continuously monitor performance and adjust MAX_TOKENS based on feedback to find the optimal balance for your application.

Start with a MAX_TOKENS value that aligns with your hardware capabilities and adjust based on performance observations and application needs.

#### Details

**Sources** - [configs/model_config.py.example](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/configs/model_config.py.example) - [server/chat/agent_chat.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/server/chat/agent_chat.py) - [server/chat/completion.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/server/chat/completion.py) - [server/chat/chat.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/server/chat/chat.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Zephyr69 commented 4 months ago

这大概率不是模型的问题，是你知识库组成和配置的问题

zixiaotan21 commented 4 months ago

这大概率不是模型的问题，是你知识库组成和配置的问题

我的知识库是下面这样的：

# 问题1

## 问题
XXX

## 回答
XXX

#问题2

## 问题
XXX

## 回答
XXX

text splitter 算法用MarkdownHeaderTextsplitter，可以将不同的问答对，QA1，QA2切分为不同的向量。

实际提问问题1的时候，也确实能将答案1 检索出来，但回答的时候，没有完全按照答案1来回答，会加入一些编造的内容，这是我的模型推理能力不足吗？

配置的话，我是用4090显卡去跑的，现在用的模型是qwen-1.5-14B-chat 。

Zephyr69 commented 4 months ago

这大概率不是模型的问题，是你知识库组成和配置的问题

我的知识库是下面这样的：
# 问题1

## 问题
XXX

## 回答
XXX

#问题2

## 问题
XXX

## 回答
XXX
text splitter 算法用MarkdownHeaderTextsplitter，可以将不同的问答对，QA1，QA2切分为不同的向量。

实际提问问题1的时候，也确实能将答案1 检索出来，但回答的时候，没有完全按照答案1来回答，会加入一些编造的内容，这是我的模型推理能力不足吗？

配置的话，我是用4090显卡去跑的，现在用的模型是qwen-1.5-14B-chat 。

模型会添加自主成分的话，可以把这种模式下的温度调低，0.01或0.1之类的。

zixiaotan21 commented 4 months ago

这大概率不是模型的问题，是你知识库组成和配置的问题

我的知识库是下面这样的：
# 问题1

## 问题
XXX

## 回答
XXX

#问题2

## 问题
XXX

## 回答
XXX
text splitter 算法用MarkdownHeaderTextsplitter，可以将不同的问答对，QA1，QA2切分为不同的向量。实际提问问题1的时候，也确实能将答案1 检索出来，但回答的时候，没有完全按照答案1来回答，会加入一些编造的内容，这是我的模型推理能力不足吗？

配置的话，我是用4090显卡去跑的，现在用的模型是qwen-1.5-14B-chat 。
模型会添加自主成分的话，可以把这种模式下的温度调低，0.01或0.1之类的。

谢谢建议！确实将temperature调低，效果会稍微好一点~

chatchat-space / Langchain-Chatchat

知识库问答—LLM模型求推荐 #3881