Raudaschl / rag-fusion

Other
802 stars 99 forks source link

Rag fusion rw 002 vector database #3

Open richardwhiteii opened 1 year ago

richardwhiteii commented 1 year ago

Implement vector search using Chroma DB, this was the first one I found that I could quickly understand. I expect it is notional and will later support any vector database.

This migrates vector search from random mock data to using the Chroma database. Document text and metadata are retrieved from Chroma and passed through the pipeline. Additional logging provides visibility into the process. Reciprocal rank fusion is updated to work with the Chroma results structure.

Update improves the backend search functionality using a real vector database, while preserving the existing pipeline structure.

TODO: Better understand vector search to remove "random" Remove logging Refactor the functions now that they are larger.

richardwhiteii commented 1 year ago

Removed comments and line spacing.

mariozupan commented 1 year ago
Navanit-git commented 1 year ago
  • I would like to see efficiency of rag-fusion on csv(or pdf) financial data tables.
  • implementation with llama2 or mistral model.

yes, using financial balance sheet and P/L sheet I want to query data on it.

Raudaschl commented 1 year ago

Hey @richardwhiteii Thank you for submitting this request. I will be reviewing it over the weekend.

Raudaschl commented 1 year ago

Hi @richardwhiteii and @Navanit-git

First off, a huge thanks to both of you for your dedication and hard work on the RAG Fusion project. Its awesome.

However, I'm a bit concerned about the added complexity, especially considering beginners who might be using this project as a stepping stone in their learning journey. While the advanced features and modularity are a boon for experienced developers, they could seem daunting for newcomers. I'd like to highlight a few areas where this complexity could be challenging:

  1. The extensive logging could potentially overshadow the core functionalities we aim to showcase.
  2. Integrating external APIs and databases, albeit powerful, introduces a complexity level that assumes considerable prior knowledge.
  3. The nuanced error handling and environmental variable configurations are indeed best practices but might not be as transparent for those just starting out.

To make this more accessible, I propose:

I'd love to hear your thoughts on these suggestions. My goal is to keep the project approachable for developers of all skill levels, and your insights would be crucial in striking this balance.

Thanks again for your invaluable contribution, and I eagerly await your perspective on making the project more beginner-friendly.

Cheers, Adrian

Raudaschl commented 1 year ago
  • I would like to see efficiency of rag-fusion on csv(or pdf) financial data tables.
  • implementation with llama2 or mistral model.

This is a really interesting idea!

richardwhiteii commented 1 year ago

I understand. I can bounce some updates your way and let me know what you think. To make sure I'm going in the right direction.

richardwhiteii commented 1 year ago

I made some updates specifically I removed the logging and added docstrings and comments. I added os.environ["TOKENIZERS_PARALLELISM"] = "false" to address a warning I received.

Let me know your thoughts.

How do you envision the branch tailored for beginners looking?