Marker-Inc-Korea / AutoRAG

RAG AutoML Tool - Find optimal RAG pipeline for your own data.
Apache License 2.0
1.1k stars 94 forks source link

Batch API Support and Incremental Data Addition/Subtraction? #417

Open kirillocha opened 2 months ago

kirillocha commented 2 months ago

Hey AutoRAG devwlopers,

Thanks for the great work on AutoRAG. It's been a valuable tool for optimizing RAG pipelines. I thought there are a couple of feature requests that could make it even more powerful and efficient:

  1. Batch API Support: With OpenAI's recent release of the Batch API, there's an opportunity to reduce the cost of generating QA data from corpus data by up to 50%. Integrating Batch API support into AutoRAG would be good for users working with large datasets and limited budgets.

  2. Incremental Data Addition/Subtraction: The current AutoRAG pipeline requires creating corpus data and generating QA data from it sequentially, as I understood. However, there are scenarios where users may have additional data to incorporate into an existing RAG pipeline, or need to remove or replace it(like renewed docs) It would be great to know if there are any plans or ezisting methods to support incremental data addition in the future or how this could be implemented.

These feature requests may require significant development efforts, but they would greatly enhance the usability and efficiency of AutoRAG. If there's anything that can be done to help or provide further clarification, please let me know.

Thanks for considering these feature requests. Looking forward to hearing your thoughts.

vkehfdl1 commented 2 months ago

Hi @kirillocha Thanks for making new issue.

First, batch API looks great. We will consider to implement it to data creation process.

Second, there are some cases that users may want to use AutoRAG with their pre-ingested corpus. Like if there are millions of embeddings on their own vector DB already, they may want to use it for benchmark. From now, we do not support that features. We can make some migration option of local Chroma, or vectorDB option via LlamaIndex. Users certainly do not want to embed all documents again. Plus, if you changed your QA or corpus data, we strongly recommend to make new project folder. It will occur make embeddings on whole corpus again. It is inefficient, so I think we might find how to resolve it.

I'll talk about this issue with my teammates and add updates for it. (Check out this issue @bwook00 @Eastsidegunn )

Thanks again @kirillocha πŸ‘

kirillocha commented 2 months ago

Thank you for addressing my suggestions and feedback in detail. I appreciate your openness to enhancing AutoRAG and your dedication to the project, as shown by your frequent commits. Your hard work is appreciated, and I'm grateful for your commitment to making AutoRAG better. Thank you for considering my input.​​​​​​​​​​​​​​​​