Closed Steve235lab closed 5 months ago
@KillianLucas @Notnaton @MikeBirdTech @CyanideByte @tyfiero Do you think this will be a good feature? If so, I would start to implement this.
It is good feature. And save a lot of money. It act like a database for OI. It also help in further conversation.
Long-term memory management is a must in order for this project to become practical. RAG is a possible solution, but there are also related discussions here which you may want to look into in order to avoid duplicate efforts.
My bad, I searched about things related but with filter for "open" ones only.
Never mind, I will implement this by myself because I need it. No PR will be opened for this as pollution to codebase.
What is RAG?
A brief explaination of RAG by GPT4-Turbo:
And here's a detailed introduction of RAG: Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models (meta.com)
Why RAG?
Currently OI uses a simple strategy to maintain the context of conversations:
context_window
setting.context_window
, remove the earliest messages from the context until it can fit into.This strategy works well on most use cases for daily tasks, however there are several problems when engaging conversations need long context, like using LLM as an assistant to summarize research papers of one field:
context_window
, the length of context sent to LLM grows linearly as the conversation going on. With models having larger and larger context window (for example, the current default modelopenai/gpt4-turbo
which has a 128K context window), this will cost a lot if a user keep asking questions in one single conversation. What's even more horrifying, if one conversation exceeded thecontext_window
, the cost of each request following won't grow but will keep at a very expensive price.With RAG, we can convert history message of current into embeddings and store them into a vector database. Every time there's a new message from the user, we can use the user's input as a query to search the most relevant context from the vector database and put into the context sent to LLM. In this way, we can have flexible context length for different questions and have more useful information in the context. Besides, we can do more with RAG in the future, for example we can add an interface for users to import local documents as background information for conversations.
All in all, as a LLM client (both as an application kernel and a standalone CLI), context management is a very low-level but important part, spend some effort on this will be helpful.
How to implement RAG in OI?
I think langchain-ai/langchain: 🦜🔗 Build context-aware reasoning applications (github.com) would be a great library to import RAG as well as other cool features to enhance the context management function of OI. Anyway, this will be a tough and huge task, a lot of research, develop and test works included. Implementation details will be updated later. BTW, I am planning to implement this as an optional feature and set it off by default, which means this is only for the users who know well about what they are playing with.