innovation64 commented 11 months ago

Chat with Repo 项目要求

核心概念

目标: 创建一个能够与代码仓库进行交互的聊天系统。
总结 : 根据用户的问题做匹配，匹配到对应的文档、代码、引用关系。然后将匹配到的结果送给大模型，让大模型去做思考，最后生成回答。
灵感来源: LangChian 的 RAG over code | 🦜️🔗 Langchain

具体要求

动态更新文档chunks对应的向量
- 文档变更监控: 由于MD文件内容可能会频繁变更，系统必须使用工具监控文档的更改，并相应地更新文档chunks的向量表示。
- 向量存储与版本控制: 必须有一个高效的向量存储系统来管理文档与其向量表示之间的一致性。
组织文档和代码块
- 检索方法: 进行embedding search，将用户查询转换成向量，再与向量数据库中的内容进行比较。基于相似度选择最相关的几个块进行返回，确保回答的准确性和相关性。
代码整合
- 加入原始代码: 文档对应的代码也应进行向量化处理，以便整合到检索过程中。
- 多路召回: 除了向量化处理外，还包括关键字检索等传统搜索方法，以及可能的语义搜索和模式匹配技术，以增强检索的全面性和准确度。
处理引用关系
- 代码块引用关系: 召回的代码块应包含其在项目中的具体位置信息，以及它与其他代码块或文档的引用关系。这有助于构建更加全面和连贯的上下文。（当前已实现）
大模型的总结与回答
- 综合回答: 大模型应该能够进行综合分析，理解复杂的代码和文档关系，并基于召回的内容形成对用户查询的综合回答。

innovation64 commented 11 months ago

29th Dec

Waiting for completion

Aiming group

developers who used the repo(senior engineer)
novice with code(only want to know how to use it )
troublemaker (other filed people,maybe will list some fool questions which don't allow basic rules)

Searching and summary

Some tools and bags

vector db: Weaviate:
base LLM inference: GPT4 API
acquire the whole repo: AST tool

Technical route

Explaination

so this is a basic RAG sys with chat repo

First we need to deal with the user's query

when the user sends a query to the chatbot, we need base history and the query, to judge if it fits the rules. if it fits the rules, we will compress the history find the relevant info about this query, and plus it. chunk and vectorize them, send them to vector db retrieval relevant repo and md vector. ---we call it setp1 if not fit the rules, send it to LLM to rebuild the query too short enlarge it too long summary it not logic, smooth it, change it in a better way. then repeat step 1.

Second chunk the repo files

there are lots of files, we need to summarize, plus the MD(repoAgent generated) file as a summary index Then put the whole project chunk into detail as a whole vector store

Matching

Step1's outputs retrieval with a summarized index vector(filter function), then matching the detail vector.

Reranking

find the top relevant infor

Result

send the relevant infor and query to the LLM ,sign it the source id/hook. generated an answer with the reference.

Details with code

Users query
- define a prompt: what is a good question ?
  
  LLM functioncall outputs a good question
- then compress history diolouge
  
  LLM functioncall depresses it (tokens based on the LLM model capability)
- vectorize llamaindex chunk the content voyager SEO for vectorizing send it to the vector db
- searching index current algorithms:
  - flat index
  - faiss，nmslib,knn,fixed tree and HNSW selected db solution:
- Chroma:
  - Focus: May specialize in specific types of data handling or querying capabilities.
  - Use Cases: Ideal for scenarios where its specialized focus aligns with your project needs.
  - Performance: Depending on its specialization, it might offer enhanced performance in certain tasks.
  - Integration and Scalability: How well it integrates with existing systems and scales with data growth.
- Pinecone:
  - Focus: Known for its efficient similarity search in high-dimensional spaces.
  - Use Cases: Excellent for recommendation systems, image and text similarity searches.
  - Performance: Offers high-performance similarity search with low latency.
  - Integration and Scalability: Easily integrates with various data pipelines and scales well.
- Weaviate:
  - Focus: Emphasizes semantic search and automatic classification using machine learning models.
  - Use Cases: Suitable for projects requiring semantic understanding, like natural language processing tasks.
  - Performance: Good at handling complex queries with a semantic component.
  - Integration and Scalability: Supports GraphQL, RESTful APIs, and has good scalability.
repo part separate it into two parts
- MD
- .py
  - Hierarchical indices(just for .py)
Reranking & filtering llamaindex
topk
fed to LLM put the query and relevant info with retrieval outputs to LLM, remember to sign .py vector source
result

generated a format result with reference

ps: answer by GPT4 对于代码生成和匹配的场景，选择合适的向量数据库需要考虑几个关键因素，如数据类型、查询性能、以及对特定代码语言结构的支持。根据您的描述，这里是对 Chroma、Pinecone 和 Weaviate 在这方面的潜在适用性的分析：

Chroma:

如果 Chroma 对代码数据类型有特殊处理或优化，它可能适合处理代码库。

评估它在处理和索引代码相关数据方面的效能。

Pinecone:

Pinecone 在处理高维空间数据时的性能优势可能对于代码片段的相似性搜索很有帮助。

考虑其在快速检索大量代码数据时的能力。

Weaviate:

Weaviate 的语义搜索能力可能对理解和匹配代码的语义结构特别有用。

如果您的应用涉及到理解代码的意图或功能，Weaviate 可能更合适。

针对代码生成和匹配选择数据库：

语义理解：如果您的应用需要深入理解代码的意图或功能，具有强大语义处理能力的数据库（如 Weaviate）可能更适合。

相似性搜索：如果您关注于代码片段之间的相似性匹配，那么专注于相似性搜索的数据库（如 Pinecone）可能是更好的选择。

数据处理和索引：考虑数据库处理和索引大量代码数据的能力，以及对特定编程语言的支持。

在做出决定之前，您可能还需要考虑其他因素，如易用性、集成性、以及社区和技术支持。建议对有潜力的数据库进行小规模的试验，以评估它们在实际代码生成和匹配场景中的表现。

                                                                                                                                    developer: Yang Lee

innovation64 commented 11 months ago

RAG part

langchain
llamaindex(recommended)

base RepoAgent part

current logic

use local file
os aquire the whole fold and sub files
select the .py files
base each file generate a md to explain the detail of each file(md name as py's name)
whole section triggered by commit to run script

some problem

wired to read this kind of format for each file explanation
no rebuild to construct as a reasonable way, just put it in to a single file
ect...

Competitive product analysis

Answer by GPT4

在目前的市场上，有几个关于RAG（检索增强生成）的实用示例和库。以下是一些主要的工具和框架，它们支持RAG模式的实现和应用：

Hugging Face的RAG：可以微调RAG模型，结合了DPR（Dense Passage Retrieval）和seq2seq技术，以增强数据检索和生成的特定任务。
RAG框架和库：
- Haystack：由Deepset提供的端到端框架，用于文档检索、阅读理解和问答。
- FARM：Deepset的Transformer库，使用PyTorch构建RAG系统。
- REALM：Google的工具包，用于利用RAG技术进行开放域问题回答。
- LangChain：支持包括提示和外部API在内的步骤链，使大型语言模型能更准确、迅速地回答问题。
支持组件：
- Jina AI：专为神经搜索设计的领先开源向量数据库，实现高性能知识检索。
- Milvus：优化了相似性搜索工作负载的向量数据库，由Zilliz支持。
- Dense Passage Retrieval (DPR)：由Facebook开发，用于有效的语义相似性搜索的段落编码。
- ColBERT：Microsoft开发的用于提取高度相关段落的最新神经检索模型。

这些工具和框架共同支持RAG模式的实施，涵盖从数据检索到文本生成的全过程。RAG系统的两个关键阶段是检索和生成。在检索阶段，使用算法（如BM25、DPR或ColBERT）拉取输入的相关信息。在生成阶段，基于检索到的上下文产生响应文本。

Retrieval-Augmented Generation (RAG) Research Paper Reading & Demo: 这个视频提供了对RAG研究论文的分析和演示。

Retrieval-Augmented Generation (RAG) Architecture in Large Language Models: 这个视频探讨了在大型语言模型中RAG架构的使用。

Retrieval-Augmented Generation chatbot, part 1: LangChain, Hugging Face, FAISS, AWS

OctoberFox11 commented 11 months ago

流程概览

开始（Start）
- 流程的起始点，接收用户的问题或查询。
问题查询（Open）
- 初步分析用户的问题。
- 判断问题类型（如编程问题、数据查询、技术指导等）。
使用RAG系统（Put[RAG]）
- RAG系统处理问题，确定是否存在已知的解决方案或相关信息。
- 使用向量数据库和字段精确搜索等多种手段综合检索
LLM相关性判断（IsFit）
- 由LLM判断资料是否和Query相关
适用LLM逻辑处理（Close[发送相关信息至LLM]）
- 将问题和RAG系统找到的相关信息一起送至LLM处理
- LLM生成解决方案或答案
输出结果（End）
不适用LLM逻辑处理（CloseB[没有相关信息]）
- 提供标准无信息回复，或引导用户寻找其他资源
无解决方案（End2）
- 通知用户此问题当前没有可行的解决方案
- 可能建议用户寻求其他渠道的帮助

LOGIC-10 commented 11 months ago

LLM相关性判断（IsFit）

由LLM判断资料是否和Query相关

适用LLM逻辑处理（Close[发送相关信息至LLM]）

将问题和RAG系统找到的相关信息一起送至LLM处理

LLM生成解决方案或答案

LLM相关性判断（IsFit）这方面你的意思是把多路召回的结果都送给LLM但是不要它回答？而是给出一个相关性判断？

Guo-Zhang commented 11 months ago

Why Chroma? What is the advantage of it compared to others?
The "tech route" seems to be a standard RAG. What will be the difficulties?

innovation64 commented 11 months ago

Why Chroma? What is the advantage of it compared to others?

The "tech route" seems to be a standard RAG. What will be the difficulties?

@LOGIC-10 @Guo-Zhang well , first for the Chroma. I don't think this is an important thing in this design system. that's not a big deal. whatever you choose Chroma, pinecone or weaviate. all of these are fine. so we just need a vector db, so never mind.

and second, frankly speaking, the tech route is just a draft based on my previous knowledge, i will correct it, i know it has a lot of problems, so I just list a formal one. based on what I searched, there are a lot of difficulties.

far from now, here is my summary source: https://innovation64.github.io/2023/12/27/RAG/

I need more time to get to know the repo and code tools,

base all of the above things,that I maybe can design an initial draft chat with repo system

so please be patient, I'm doing more research on it and will give you guys a solution eventually

innovation64 commented 11 months ago

Here is code part without rag sys Answer by GPT4

Identify File Types: Determine the types of files you want to process. This may include source code files (.py, .js, .java, etc.), configuration files (.xml, .json, .yml), documentation files (.md, .txt), and others.
Access Repository Contents via GitHub API: Use the GitHub API to access the contents of the repository. The key API endpoint to use is:
- GET /repos/{owner}/{repo}/contents/{path}: This retrieves the contents of a specific path within the repository. If the path is a directory, it returns a list of all files and subdirectories within it.
Recursively Traverse the Repository: To get files from all directories in the repository, recursively traverse each directory. For each directory, use the API endpoint to retrieve its contents and download the relevant files.
Download and Store Files: Download the content of each identified file, which may require handling different file formats and possibly converting them to a uniform format for processing.
Preprocess Files for LLM: Depending on the file type and the intended processing, you might need to preprocess the files. For source code, this might include extracting comments or code snippets. For documentation, it might involve converting markdown or HTML to plain text.
Feed Files into LLM: Once preprocessed, feed the content into the Large Language Model. The LLM can be used for various tasks like code analysis, generating documentation, summarizing changes, etc.
Process LLM Output: Use the output from the LLM as needed, which could include generating insights, automating tasks, or further analyzing the content.

Remember to comply with GitHub's terms and API rate limits. Also, consider the computational resources and potential costs if processing a large number of files or performing complex analysis.

LOGIC-10 commented 11 months ago

This analysis provided by chatgpt is really too superficial.

I think the 6 steps I mentioned in the meeting that night were very clear and specific, including the technical details and precautions for each item. I suggest that you ask @Umpire2018 for the recording files of the meeting, and then explore more possibilities and try different details based on my existing solution instead of starting from scratch. I believe this will greatly improve your efficiency.

innovation64 commented 11 months ago

This analysis is really too superficial.

I think the 6 steps I mentioned in the meeting that night were very clear and specific, including the technical details and precautions for each item. I suggest that you ask @Umpire2018 for the recording files of the meeting, and then explore more possibilities and try different details based on my existing solution instead of starting from scratch. I believe this will greatly improve your efficiency.

@LOGIC-10 I know, I'm still designing a chat repo with rag sys, but not put it into issue. and I will check out the 6 steps you mentioned that night, to make sure that i would not lose something

LOGIC-10 commented 11 months ago

That's great. Looking forward to your implementation in several days.

Umpire2018 commented 11 months ago

RAG part

langchain

llamaindex(recommended)

👨‍👩‍👧‍👦 Who is LlamaIndex for?

LlamaIndex provides tools for beginners, advanced users, and everyone in between.

Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code.

For more complex applications, our lower-level APIs allow advanced users to customize and extend any module—data connectors, indices, retrievers, query engines, reranking modules—to fit their needs.

@innovation64 个人认为对 RAG 系统提出的讨论已经足够，接下来就可以使用现成的框架去进行开发了。目前看来使用 llamaindex , Chroma 是比较好的选择。Let's do it!

那么你对此问题的分工有什么见解？比如说是否你先来开始，后续有什么需要我去做的你再指派给我？

innovation64 commented 11 months ago

RAG part

langchain

llamaindex(recommended)

👨‍👩‍👧‍👦 Who is LlamaIndex for?

LlamaIndex provides tools for beginners, advanced users, and everyone in between.

Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code.

For more complex applications, our lower-level APIs allow advanced users to customize and extend any module—data connectors, indices, retrievers, query engines, reranking modules—to fit their needs.

@innovation64 个人认为对 RAG 系统提出的讨论已经足够，接下来就可以使用现成的框架去进行开发了。目前看来使用 llamaindex , Chroma 是比较好的选择。Let's do it!

那么你对此问题的分工有什么见解？比如说是否你先来开始，后续有什么需要我去做的你再指派给我？

ok,i will try it first

innovation64 commented 11 months ago

目前进度，用gradio搭建基本的chatbot完毕目前功能

多轮对话
gradio UI
目前只使用openai API 没有增加rag

RAG部分

基本流程运行出现OOD

默认的llamaindex的嵌入模型生成repo的index会显存益处，改用opeaiembedding处理

Umpire2018 commented 11 months ago

用gradio搭建基本的chatbot完毕

也许可以先从命令行 Chat 开始，之后再考虑前端的事情。

innovation64 commented 11 months ago

目前进度

基本的gradio 交互页面已完成
基本的rag pipeline已完成
尚未接入向量数据库,正在完善本地保将json化向量接入数据库
正在进行对于代码chunck，目前参考
目前代码较乱还没有封装整合暂时不宜PR

LOGIC-10 commented 11 months ago

🚀cool，你可以创建一个分支chat_with_repo，我们可以看着分支进行参考和下一步讨论，这样会更清晰。

innovation64 commented 11 months ago

目前进度 January 4th

完成基本交互界面
完成基本多轮对话界面
完成Automering 检索算法的 RAG
完成数据库存储

剩余任务

前置query的规范
数据库更新触发机制
代码文件chunk单独分类与合并树节点向量化
rerank优化问题

Umpire2018 commented 10 months ago

Finished in #32 .

OpenBMB / RepoAgent

chat with repo workflow issue #25

Chat with Repo 项目要求

核心概念

具体要求

Aiming group

Searching and summary

Some tools and bags

Technical route

Explaination

First we need to deal with the user's query

Second chunk the repo files

Matching

Reranking

Result

Details with code

RAG part

base RepoAgent part

current logic

some problem

Competitive product analysis

流程概览

RAG part

RAG part