feat(benchmark): Implement flexible model selection in TestsetGenerator for improved customization and consistency

Describe the Feature Currently, the tests/benchmarks/benchmark_testsetgen.py script has hardcoded LLM models for generator_llm, critic_llm, and embeddings. However, other Ragas scripts have the ability to override these defaults when called from a Jupyter notebook (etc.).

Why is the feature important for you?

Current code in benchmark_testsetgen.py:

# hardcoding of values in tests/benchmarks/benchmark_testsetgen.py
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)

Because of the hardcoding in the benchmark_testsetgen.py script it negates the ability to dynamically set global defaults in notebooks / scripts for the generator, critic, and embedding models when calling Ragas:

# setting of defaults in Jupyter Notebook or script calling Ragas
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Additional context Modify the benchmark_testsetgen.py script to mirror src/ragas/testset/generator.py and support optional parameters to allow for consistent overriding of default settings for generator_llm, critic_llm, and embedding_model. This will allow users to set global and consistent defaults when calling Ragas from a script / notebook:

# updates to benchmark_testsetgen.py
def initialize_testset_generator(
    generator_llm="gpt-3.5-turbo-16k",
    critic_llm="gpt-4",
    embeddings="text-embedding-ada-002"
):
    generator_llm = ChatOpenAI(model=generator_llm)
    critic_llm = ChatOpenAI(model=critic_llm)
    embeddings = OpenAIEmbeddings(model=embeddings)

    return TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)

# Usage
generator = initialize_testset_generator()

I was surprised when I was running the Ragas scripts from a notebook that it ignored the GPT-4o settings for the critic_llm, and used the much more expensive and older base GPT-4 model. A number of better variants on the approach above, but this should be sufficient.

The essential requirement is consistency and transparency of models used during specific steps of the process.

explodinggradients / ragas

feat(benchmark): Implement flexible model selection in TestsetGenerator for improved customization and consistency #1042