assafelovic / gpt-researcher

GPT based autonomous agent that does online comprehensive research on any given topic
https://gptr.dev
MIT License
12.98k stars 1.61k forks source link

Support for different Embeddings other than OpenAI #629

Open mmuyakwa opened 6 days ago

mmuyakwa commented 6 days ago

Due to the fact, that OpenAI chose to not accept prepaid-creditcards anymore. (This is a microsoft-thing...) I cannot use the OpenAI-API anymore.

Langchain is automatically used to use the Embeddings from OpenAI.

Ich am providing gpt-researcher the OPENROUTER-API key in order to access GPT-4o from there. Unfortunately Openrouter has no Embeddings.

I would prefer to use GROQ as LLM. But also there it seems to default back to OpenAI's embeddings.

I tried to write a script to use "nomic-ai/nomic-embed-text-v1", but it does not work.

import os
import asyncio
from dotenv import load_dotenv
from gpt_researcher import GPTResearcher
from langchain_community.embeddings import HuggingFaceEmbeddings
from gpt_researcher.memory.embeddings import Memory
from gpt_researcher.config import Config # Tested. Not sure if this actually works

load_dotenv(override=True)

# Setting the vars
os.environ["USER_AGENT"] = os.getenv("USER_AGENT", "My-Test-APP")
os.environ["LLM_PROVIDER"] = os.getenv("LLM_PROVIDER", "groq")
os.environ["OPENAI_BASE_URL"] = os.getenv("GROQ_BASE_URL")
os.environ["FAST_LLM_MODEL"] = os.getenv("FAST_LLM_MODEL", "mixtral-8x7b-32768")
os.environ["SMART_LLM_MODEL"] = os.getenv("SMART_LLM_MODEL", "mixtral-8x7b-32768")
os.environ["TEMPERATURE"] = os.getenv("TEMPERATURE", "0.55")
os.environ["EMBEDDING_PROVIDER"] = "custom"

# Overwriting the Memory-Class
class CustomMemory(Memory):
    def __init__(self, embedding_provider):
        if embedding_provider == "nomic":
            self.embedding_model = HuggingFaceEmbeddings(
                model_name="nomic-ai/nomic-embed-text-v1",
                model_kwargs={'trust_remote_code': True}
            )
        else:
            super().__init__(embedding_provider)

    def embed_documents(self, documents):
        if hasattr(self, 'embedding_model'):
            return self.embedding_model.embed_documents(documents)
        return super().embed_documents(documents)

# Overwriting the Memory-Class in gpt_researcher
import gpt_researcher.memory.embeddings
gpt_researcher.memory.embeddings.Memory = CustomMemory

# Overwriting the Config-Class
class CustomConfig(Config):
    def __init__(self):
        super().__init__()
        self.embedding_provider = "nomic"

# Overwriting the Config-Class in gpt_researcher
import gpt_researcher.config
gpt_researcher.config.Config = CustomConfig

async def get_report(query: str, report_type: str) -> str:
    researcher = GPTResearcher(query, report_type)
    research_result = await researcher.conduct_research()
    report = await researcher.write_report()
    return report

if __name__ == "__main__":
    query = "what team may win the NBA finals?"
    report_type = "research_report"

    report = asyncio.run(get_report(query, report_type))
    if report:
        print(report)

Would have liked to see more examples in documentation.