Rethinking The Design - Githubissues

koleshjr commented 6 months ago

First of all great work. I was looking to contribute to the repo but the design is a bit a limiting factor. This is my idea though and you can tell me your thoughts. First you have only focused on pinecone, what if someone wants to use milvus, or chroma or faiss or qdrant? The indexing part , you have used the default splitter of loaders. But that's not usually enough based on my experience. What if I want to use recursive character splitter, or spacy's sentence splitter? The loading part is okay as you have catered for multiple files and can be easily scaled to other documents too. But you have hidden it , it is not easily retrievable.

You current design: embedding to other steps . My proposed design to make this generalizable and useful to other choices too.

My suggestion break each part into its own module: document_loaders, embedding models, splitters , vectordatabases to make each step scalable independently and lastly build a retrieval step that is easily flexible to the modules created.

I would love to help in this redesigning if it is something you are willing to work on

Thanks

KevKibe commented 6 months ago

Thanks for the feedback.

I think Chroma and Faiss are not suitable for production, unless there's an offering a service for storing embeddings. I have no experience with milvus or qdrant, possible you could help with that. For splitting, I was going for the best possible general-use splitter RecursiveCharacterTextSplitter but makes sense giving users more options.

Could you elaborate on your statement: 'The loading part is fine as it handles multiple file types and can be easily scaled to other documents as well. However, it's not easily retrievable due to being somewhat hidden.'?

I would very much appreciate your help

koleshjr commented 6 months ago

Thanks for replying. Yeah true Chroma and faiss are not production ready but for people like me who would love to use docindex for experiments? It will force me to look for other alternatives so I think it is nice to have choices for experimenting too, and production as well.

The loading part is only about visibility. When I come to the repo It is not clear where I can look for supported document loaders. I have to traverse the files to see which one has the dataloaders. So if I wanted to confirm if docindex supports csv loader I will take a lot of time to get where they are.

Also docindex can be much more. I guess at the moment you are only focusing on the indexing part, but also you could add support for retrieval as well. So it will be a one stop fashion for all RAG use cases. That would be a package I would want to use instead of writing the retrieval again from scratch.

Suggested refactor: src: -document loaders: a class with all possible document loaderrs tha can easily be scaled alone -embeddings: a class with all possible embeddings choices not just google and openai , what about mistral and bge? -splitters: a class with support of multiple splitters

vector database: a class with support of multiple vector databases
retrieval: a class with support of different types of retrievers

now then create an index.py that will be flexible to the modules above giving the user multiple choices to choose from and lastly a retrieval.py that he/she can chose different methods of retrievers

also think about the new methods of reranking and how we can include in the project.

Goal if I have a RAG problem. This is the one stop shop I would like to come to use

KevKibe commented 6 months ago

Sounds good, so we give developers the flexibility to:

Load various document types and configure different splitting options.
Convert source data into embeddings using a selection of embedding service providers.
Store embeddings across different vector database providers.
Explore various retrieval methods with reranking and additional capabilities.

For an extra touch, a deployment pipeline with logging and monitoring. Anything else I've let out?

Feel free to make the changes and submit a PR

koleshjr commented 6 months ago

Not really. Flexibility is the key here. Sure will work on it this night. I will work on the different modules. I am not good in logging, monitoring and deployement and I will be glad to learn from you :)

koleshjr commented 6 months ago

It looks like I am going to break a lot of things.

KevKibe commented 6 months ago

Haha no worries. You can push to main instead of master, we build everything on there and then after we're done with all the breaking changes, push to master

koleshjr commented 6 months ago

Nice

koleshjr commented 6 months ago

I am going to close this issue for now. I understand what docindex is right now :). My changes were going to change what you intended docindex to be and I do not want that. I am going to take inspiration from this project and implement the changes mentioned above to a new package and leave docindex to be what you intended it to be.

Thanks

KevKibe commented 6 months ago

That's okay Do you mind helping out with implementation of retrieval and reranking modules on DocIndex?

The new LCEL chains are kicking my ass, I'm more conversant with the previous retrieval chains(without .invoke() )😅.

So that we have:

Source document Indexing -> Embeddings storage -> VectorStore Initialization -> Retrieval and Reranking

We can then start working on this new package

koleshjr commented 6 months ago

sure I will try to work on it before the week ends

KevKibe / docindex

Rethinking The Design #15