Closed innovation64 closed 10 months ago
29th Dec
Waiting for completion
so this is a basic RAG sys with chat repo
when the user sends a query to the chatbot, we need base history and the query, to judge if it fits the rules. if it fits the rules, we will compress the history find the relevant info about this query, and plus it. chunk and vectorize them, send them to vector db retrieval relevant repo and md vector. ---we call it setp1 if not fit the rules, send it to LLM to rebuild the query too short enlarge it too long summary it not logic, smooth it, change it in a better way. then repeat step 1.
there are lots of files, we need to summarize, plus the MD(repoAgent generated) file as a summary index Then put the whole project chunk into detail as a whole vector store
Step1's outputs retrieval with a summarized index vector(filter function), then matching the detail vector.
find the top relevant infor
send the relevant infor and query to the LLM ,sign it the source id/hook. generated an answer with the reference.
Users query
define a prompt: what is a good question ?
LLM functioncall outputs a good question
then compress history diolouge
LLM functioncall depresses it (tokens based on the LLM model capability)
vectorize llamaindex chunk the content voyager SEO for vectorizing send it to the vector db
searching index current algorithms:
Chroma:
Pinecone:
Weaviate:
repo part separate it into two parts
Reranking & filtering llamaindex
topk
fed to LLM put the query and relevant info with retrieval outputs to LLM, remember to sign .py vector source
result
generated a format result with reference
ps: answer by GPT4 对于代码生成和匹配的场景,选择合适的向量数据库需要考虑几个关键因素,如数据类型、查询性能、以及对特定代码语言结构的支持。根据您的描述,这里是对 Chroma、Pinecone 和 Weaviate 在这方面的潜在适用性的分析:
Chroma:
- 如果 Chroma 对代码数据类型有特殊处理或优化,它可能适合处理代码库。
- 评估它在处理和索引代码相关数据方面的效能。
Pinecone:
- Pinecone 在处理高维空间数据时的性能优势可能对于代码片段的相似性搜索很有帮助。
- 考虑其在快速检索大量代码数据时的能力。
Weaviate:
- Weaviate 的语义搜索能力可能对理解和匹配代码的语义结构特别有用。
- 如果您的应用涉及到理解代码的意图或功能,Weaviate 可能更合适。
针对代码生成和匹配选择数据库:
- 语义理解:如果您的应用需要深入理解代码的意图或功能,具有强大语义处理能力的数据库(如 Weaviate)可能更适合。
- 相似性搜索:如果您关注于代码片段之间的相似性匹配,那么专注于相似性搜索的数据库(如 Pinecone)可能是更好的选择。
- 数据处理和索引:考虑数据库处理和索引大量代码数据的能力,以及对特定编程语言的支持。
在做出决定之前,您可能还需要考虑其他因素,如易用性、集成性、以及社区和技术支持。建议对有潜力的数据库进行小规模的试验,以评估它们在实际代码生成和匹配场景中的表现。
developer: Yang Lee
Answer by GPT4
在目前的市场上,有几个关于RAG(检索增强生成)的实用示例和库。以下是一些主要的工具和框架,它们支持RAG模式的实现和应用:
Hugging Face的RAG:可以微调RAG模型,结合了DPR(Dense Passage Retrieval)和seq2seq技术,以增强数据检索和生成的特定任务。
RAG框架和库:
支持组件:
这些工具和框架共同支持RAG模式的实施,涵盖从数据检索到文本生成的全过程。RAG系统的两个关键阶段是检索和生成。在检索阶段,使用算法(如BM25、DPR或ColBERT)拉取输入的相关信息。在生成阶段,基于检索到的上下文产生响应文本。
Retrieval-Augmented Generation (RAG) Research Paper Reading & Demo: 这个视频提供了对RAG研究论文的分析和演示。
Retrieval-Augmented Generation (RAG) Architecture in Large Language Models: 这个视频探讨了在大型语言模型中RAG架构的使用。
Retrieval-Augmented Generation chatbot, part 1: LangChain, Hugging Face, FAISS, AWS
开始(Start)
问题查询(Open)
使用RAG系统(Put[RAG])
LLM相关性判断(IsFit)
适用LLM逻辑处理(Close[发送相关信息至LLM])
输出结果(End)
不适用LLM逻辑处理(CloseB[没有相关信息])
无解决方案(End2)
LLM相关性判断(IsFit)
- 由LLM判断资料是否和Query相关
适用LLM逻辑处理(Close[发送相关信息至LLM])
- 将问题和RAG系统找到的相关信息一起送至LLM处理
- LLM生成解决方案或答案
LLM相关性判断(IsFit)这方面你的意思是把多路召回的结果都送给LLM但是不要它回答?而是给出一个相关性判断?
- Why Chroma? What is the advantage of it compared to others?
- The "tech route" seems to be a standard RAG. What will be the difficulties?
@LOGIC-10 @Guo-Zhang well , first for the Chroma. I don't think this is an important thing in this design system. that's not a big deal. whatever you choose Chroma, pinecone or weaviate. all of these are fine. so we just need a vector db, so never mind.
and second, frankly speaking, the tech route is just a draft based on my previous knowledge, i will correct it, i know it has a lot of problems, so I just list a formal one. based on what I searched, there are a lot of difficulties.
far from now, here is my summary source: https://innovation64.github.io/2023/12/27/RAG/
I need more time to get to know the repo and code tools,
base all of the above things,that I maybe can design an initial draft chat with repo system
so please be patient, I'm doing more research on it and will give you guys a solution eventually
Here is code part without rag sys Answer by GPT4
Identify File Types: Determine the types of files you want to process. This may include source code files (.py, .js, .java, etc.), configuration files (.xml, .json, .yml), documentation files (.md, .txt), and others.
Access Repository Contents via GitHub API: Use the GitHub API to access the contents of the repository. The key API endpoint to use is:
GET /repos/{owner}/{repo}/contents/{path}
: This retrieves the contents of a specific path within the repository. If the path is a directory, it returns a list of all files and subdirectories within it.Recursively Traverse the Repository: To get files from all directories in the repository, recursively traverse each directory. For each directory, use the API endpoint to retrieve its contents and download the relevant files.
Download and Store Files: Download the content of each identified file, which may require handling different file formats and possibly converting them to a uniform format for processing.
Preprocess Files for LLM: Depending on the file type and the intended processing, you might need to preprocess the files. For source code, this might include extracting comments or code snippets. For documentation, it might involve converting markdown or HTML to plain text.
Feed Files into LLM: Once preprocessed, feed the content into the Large Language Model. The LLM can be used for various tasks like code analysis, generating documentation, summarizing changes, etc.
Process LLM Output: Use the output from the LLM as needed, which could include generating insights, automating tasks, or further analyzing the content.
Remember to comply with GitHub's terms and API rate limits. Also, consider the computational resources and potential costs if processing a large number of files or performing complex analysis.
This analysis provided by chatgpt is really too superficial.
I think the 6 steps I mentioned in the meeting that night were very clear and specific, including the technical details and precautions for each item. I suggest that you ask @Umpire2018 for the recording files of the meeting, and then explore more possibilities and try different details based on my existing solution instead of starting from scratch. I believe this will greatly improve your efficiency.
This analysis is really too superficial.
I think the 6 steps I mentioned in the meeting that night were very clear and specific, including the technical details and precautions for each item. I suggest that you ask @Umpire2018 for the recording files of the meeting, and then explore more possibilities and try different details based on my existing solution instead of starting from scratch. I believe this will greatly improve your efficiency.
@LOGIC-10 I know, I'm still designing a chat repo with rag sys, but not put it into issue. and I will check out the 6 steps you mentioned that night, to make sure that i would not lose something
That's great. Looking forward to your implementation in several days.
RAG part
- langchain
- llamaindex(recommended)
👨👩👧👦 Who is LlamaIndex for?
LlamaIndex provides tools for beginners, advanced users, and everyone in between.
Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code.
For more complex applications, our lower-level APIs allow advanced users to customize and extend any module—data connectors, indices, retrievers, query engines, reranking modules—to fit their needs.
@innovation64 个人认为对 RAG 系统提出的讨论已经足够,接下来就可以使用现成的框架去进行开发了。目前看来使用 llamaindex , Chroma 是比较好的选择。Let's do it!
那么你对此问题的分工有什么见解?比如说是否你先来开始,后续有什么需要我去做的你再指派给我?
RAG part
- langchain
- llamaindex(recommended)
👨👩👧👦 Who is LlamaIndex for?
LlamaIndex provides tools for beginners, advanced users, and everyone in between.
Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code.
For more complex applications, our lower-level APIs allow advanced users to customize and extend any module—data connectors, indices, retrievers, query engines, reranking modules—to fit their needs.
@innovation64 个人认为对 RAG 系统提出的讨论已经足够,接下来就可以使用现成的框架去进行开发了。目前看来使用 llamaindex , Chroma 是比较好的选择。Let's do it!
那么你对此问题的分工有什么见解?比如说是否你先来开始,后续有什么需要我去做的你再指派给我?
ok,i will try it first
目前进度,用gradio搭建基本的chatbot完毕 目前功能
RAG部分
默认的llamaindex的嵌入模型生成repo的index会显存益处,改用opeaiembedding处理
用gradio搭建基本的chatbot完毕
也许可以先从命令行 Chat 开始,之后再考虑前端的事情。
目前进度
🚀cool,你可以创建一个分支chat_with_repo,我们可以看着分支进行参考和下一步讨论,这样会更清晰。
目前进度 January 4th
剩余任务
Finished in #32 .
Chat with Repo 项目要求
核心概念
具体要求
动态更新文档chunks对应的向量
组织文档和代码块
代码整合
处理引用关系
大模型的总结与回答