Feature Request: Proposed Integration of Automatic Translation Feature, Enhanced Vector Stores, and Expanded Support for Embedding Models

CMobley7 commented 10 months ago

I am very interested in contributing to the enhancement of the project by integrating a couple of features. I am willing to actively participate in the development of these features and would appreciate guidance on your preferences for their implementation. Here are the detailed proposals and potential approaches:

1. Automatic Translation Feature

Background

While LLMs like GPT-4 and Llama 2 effectively handle high-resource languages, there is a notable gap in their performance with low-resource languages. Additionally, translating high-quality content into low-resource languages can often be time-consuming and costly.

Proposal

I propose the development of a feature that allows users to inquire about documents in high-resource languages, irrespective of the language they speak. The goal is to facilitate the creation and sharing of 'brains' on a larger, more inclusive scale.

Implementation

We can potentially achieve this by integrating translation services such as Google Translate or leveraging models available from HuggingFace, including NLLB-200 (source) or SeamlessM4T (source) in our processing pipeline. This integration seems feasible via Langchain agents, though implementing Google Translate might necessitate a separate feature request to Langchain. This feature could also be developed without relying on agents.

Potential Backend Pipeline:

User Prompt -> Translation Engine -> Document Lookup -> 
Prompt Creation -> Answer Generation -> Translation Engine -> Output

Frontend Enhancements:

Additional fields to facilitate functionality
Dropdown menus for selecting source and destination languages

With some direction, I am confident I could develop the backend for this feature but might require assistance with the frontend implementation.

2. Integration with Vector Stores

Proposal

I propose adding Weaviate and Pinecone to the list of supported vector stores. This feature would potentially require a more comprehensive discussion to ensure they're added in a way that is easily maintainable and allows for adding additional stores easily in the future.

Current Developments

This idea seems to align with ongoing developments in the project.

3. Support for Various Embedding Models

Background

Currently, a lot of embedding models listed on the HuggingFace leaderboard are not supported, and integrating them to work with the desired Large Language Model (LLM) is necessary.

Proposal

I propose the inclusion of support for most embedding models listed on the HuggingFace leaderboard and their integration with the desired LLM.

Implementation

I noticed that multiple efforts are already underway. This pull request adds support for multiple LLMs. However, the litellm repository does not appear to support embeddings yet, indicating a potential area for enhancement. GenossGPT, seems to offer a more integrated approach with LangChain. Future collaborations between GenossGPT and LocalAI might also be on the horizon, according to this issue.

I eagerly anticipate the possibility of contributing to these enhancements and look forward to your feedback and guidance on how to proceed.

Thank you! CMobley7

ishaan-jaff commented 10 months ago

@CMobley7 i'm the maintainer of litellm - we do support embedding. Was there a specific model / provider LiteLLM is missing ? Would love to add it for you

CMobley7 commented 10 months ago

@ishaan-jaff , thank you for your swift reply. I must admit I may have misunderstood the current capabilities of litellm regarding embedding support. Upon reviewing the code at this location, I initially assumed that embedding support was not fully implemented for most LLMs within litellm. However, it appears that openai embeddings are facilitated in other sections of the codebase.

I was hoping to expand support within quivr to include HuggingFace models, either through an endpoint or a local setup. Currently, the model that stands out as a prime candidate for integration is BAAI/bge-large-en as it's the most proformant at retrieval. I look forward to your thoughts on this.

Thank you, CMobley7

ishaan-jaff commented 10 months ago

will have this ready for you in the next 6 hours😊

i'm aiming to get LiteLLM integrated with Quivr in our PR soon

CMobley7 commented 10 months ago

Thanks, I really appreciate this!

ishaan-jaff commented 10 months ago

@CMobley7 investigated into this looks like langchain has support for BAAI/bge-large-en

from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-small-en"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
model_norm = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

Since quivr already uses langchain why not use the embedding model like that ?
noticed that the BAAI/bge-small-en was not hosted. Is there a hosted provider/API you're trying to use for BAAI/bge-large-en

StanGirard commented 10 months ago

Hey @CMobley7 !

That is awesome ! Let's start by 2. Integration with Vector Stores.

We are almost there and need a few changes.

The first goal is to be compatible with another PostgreSQL provider. Currently only supabase. So what we did is we created a folder in which we put all functions that need to be implemented to support a new provider.

We have not finished this yet but would love some help.

Your first mission of you accept is to make it work with a local PostgreSQL powered with Pgvector.

Once this is working we can then modify the calls to also store the embeddings in a vectorstore like pinecone. So we would end up with a PostgreSQL and a pinecone.

StanGirard commented 10 months ago

Here are all the functions that are currently tied to supabase.

They're in backend/repository

Also I'm currently on vacation but we could talk on discord if needed or you can ask questions to @gozineb.

StanGirard commented 10 months ago

Here are all the functions that are currently tied to supabase.

They're in backend/repository

Also I'm currently on vacation but we could talk on discord if needed or you can ask questions to @gozineb.

CMobley7 commented 10 months ago

@StanGirard , I'll join your discord and get started on that feature this weekend!

github-actions[bot] commented 9 months ago

Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.

QuivrHQ / quivr