explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.22k stars 737 forks source link

feat(benchmark): Implement flexible model selection in TestsetGenerator for improved customization and consistency #1042

Open donbr opened 4 months ago

donbr commented 4 months ago

Describe the Feature Currently, the tests/benchmarks/benchmark_testsetgen.py script has hardcoded LLM models for generator_llm, critic_llm, and embeddings. However, other Ragas scripts have the ability to override these defaults when called from a Jupyter notebook (etc.).

Why is the feature important for you?

Current code in benchmark_testsetgen.py:

# hardcoding of values in tests/benchmarks/benchmark_testsetgen.py
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)

Because of the hardcoding in the benchmark_testsetgen.py script it negates the ability to dynamically set global defaults in notebooks / scripts for the generator, critic, and embedding models when calling Ragas:

# setting of defaults in Jupyter Notebook or script calling Ragas
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Additional context Modify the benchmark_testsetgen.py script to mirror src/ragas/testset/generator.py and support optional parameters to allow for consistent overriding of default settings for generator_llm, critic_llm, and embedding_model. This will allow users to set global and consistent defaults when calling Ragas from a script / notebook:

# updates to benchmark_testsetgen.py
def initialize_testset_generator(
    generator_llm="gpt-3.5-turbo-16k",
    critic_llm="gpt-4",
    embeddings="text-embedding-ada-002"
):
    generator_llm = ChatOpenAI(model=generator_llm)
    critic_llm = ChatOpenAI(model=critic_llm)
    embeddings = OpenAIEmbeddings(model=embeddings)

    return TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)

# Usage
generator = initialize_testset_generator()

I was surprised when I was running the Ragas scripts from a notebook that it ignored the GPT-4o settings for the critic_llm, and used the much more expensive and older base GPT-4 model. A number of better variants on the approach above, but this should be sufficient.

The essential requirement is consistency and transparency of models used during specific steps of the process.

jjmachan commented 4 months ago

hey @donbr I'm not sure if I understood the idea but we can add that quite easily to the benchmark. but I was not able to understand the usecase you had in mind.

the tests/benchmark directory is meant as a test suite for developing, hence the hard coded LLMs. what was the usecase for calling them directly? are you running the benchmarks yourself for different LLMs too?

in that case we can easily add the change you proposed! it would improve it like you said