知识库Embedding模型问题

4ART1ST commented 7 months ago

在做不同Embedding模型对比的工作，已成功完成m3e-base, bge-base, text2vec-paragraph和piccolo-base，本来打算增加ernie-base但发现它基本都无法回答，匹配不到正确文档，不知道为啥效果这么差 ![Uploading image.png…]() 发现Langchain不支持gte和stella，现在只能考虑达摩院那个了，nlp_gte只能部署large吗，有没有人试过base，因为是横向对比所以感觉base更公平一些

dosubot[bot] commented 7 months ago

Hey @4ART1ST, great to see you diving into another intriguing project! How's everything going?

Based on the information provided in the Langchain-Chatchat repository, the ernie-base model's poor performance in answering questions or matching documents effectively could be due to several factors. These include the model's configuration and parameters, the quality and relevance of the training data, the document retrieval and reranking logic, the construction of the prompt and the context provided to the model, the use of a reranker, external factors like API limitations or network latency, lack of comprehensive error handling and debugging information, and the model version and updates. You can find more detailed information about these potential issues in the knowledge_base_chat.py file.

As for the gte and stella models, I couldn't find any specific requirements or dependencies for these models in the Langchain-Chatchat codebase. It's possible that these models are not currently supported or that the information is not readily available in the repository.

Regarding the nlp_gte model from Damo Academy, the specific requirements and dependencies in the Langchain-Chatchat codebase are as follows:

Model Name: nlp_gte_sentence-embedding_chinese-large
Model Source/Path: "damo/nlp_gte_sentence-embedding_chinese-large"

This information is derived from the MODEL_PATH dictionary under the embed_model key in the provided context from the file configs/model_config.py.example. There are no explicit dependencies mentioned specifically for the nlp_gte model beyond what is generally required for the operation of the Langchain-Chatchat codebase. The configuration also suggests that models can be run on various devices ("cuda","mps","cpu","xpu") as specified by the EMBEDDING_DEVICE and LLM_DEVICE settings, implying a dependency on a compatible hardware and software setup for these devices.

As for the 'base' version of the nlp_gte model, I couldn't find any specific information about its deployability in the Langchain-Chatchat codebase. It's possible that the 'base' version could be used, but this would likely depend on the specific requirements of your project and the compatibility of the 'base' version with the Langchain-Chatchat codebase.

I hope this information helps! If you have any further questions or need more detailed information, please let me know.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

guijuzhejiang commented 7 months ago

@4ART1ST 老兄对比有结果吗？我用bge-large-zh-v1.5感觉效果不太好啊。用民法典做向量库，搜第三十六条是什么都不对。看了下匹配结果：出来第一百三十一条。。。。。，第三百六十七条。。。。。。就是没有第三十六条。。。。。应该是ebedding 模型的特征表达不够好啊，还有更好的embedding模型吗？不想用openai的，想用本地化开源的

4ART1ST commented 7 months ago

@4ART1ST 老兄对比有结果吗？我用bge-large-zh-v1.5感觉效果不太好啊。用民法典做向量库，搜第三十六条是什么都不对。看了下匹配结果：出来第一百三十一条。。。。。，第三百六十七条。。。。。。就是没有第三十六条。。。。。应该是ebedding 模型的特征表达不够好啊，还有更好的embedding模型吗？不想用openai的，想用本地化开源的

可以试试m3e，我的m3e-base比bge-base好一些

liaohaoxiang commented 6 months ago

m3e-base更好，但是得用LLM模型以中文训练为主的，我对比了qwen-max和GPT-4-preview，感觉qwen的更好

chatchat-space / Langchain-Chatchat

知识库Embedding模型问题 #3614

Sources