confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
3.14k stars 246 forks source link

Parallelization of Evaluations #500

Open AndresPrez opened 7 months ago

AndresPrez commented 7 months ago

Usually these sort of evaluations are made on large datasets of Q&A interactions. Deepeval's interface however is implemented in a way that calls to the LLM Evaluators Agents are done sequentially and synchronously.

Describe the solution you'd like Deepeval's API interfaces could be extended to support either async calls and/or batched test cases. For example the BaseMetric.measure() function could be extended to either be an async function and/or accept a list of LLMTestCase objects as input.

Describe alternatives you've considered I've tried to "paralelize" these on my end using asyncio.to_thread(...) function, however there's a limit on the number of threads I can generate efficiently.

Additional context There's an increasingly number of SDKs now supporting async and/or batch requests to LLM providers. Such as OpenAI and Google Vertex AI.

References:

OpenAI Batching: https://platform.openai.com/docs/guides/production-best-practices/batching Async: https://github.com/openai/openai-python?tab=readme-ov-file#async-usage

Vertex AI Batching:

Async: https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextGenerationModel#vertexai_language_models_TextGenerationModel_predict_async

penguine-ip commented 7 months ago

Hey @AndresPrez , what do you mean by evaluators? (are you talking about the llamaindex integration?)

Do you have some code to clarify what you mean, thanks.

AndresPrez commented 7 months ago

I updated the description, but by evaluators I meant the LLMs that get called. Some LLM SDKs support batch requests like Azure's. Let me complement the description with some reference links

penguine-ip commented 7 months ago

@AndresPrez Can you let me know out of the three methods to measure a metric, how are you currently doing it? https://docs.confident-ai.com/docs/metrics-introduction#measuring-a-metric

AndresPrez commented 7 months ago

@penguine-ip I'm actually trying out all of them, and for Custom LLM I'm inheriting deepeval's DeepEvalBaseLLM class and implementing it with Google Vertex AI Models.

Now, for example OpenAI SDKs supports async completions (added references to description). So it would be awesome to extend deepeval's interface to support async measuring. That way we can run metric measuring on many test cases in "parallel". In addition to that they also support batching requests (i.e., sending multiple prompts in a single request), so another increase in parallelization can be achieved leveraging that.

penguine-ip commented 7 months ago

@AndresPrez Does the openai async not use threads? I'm thinking you will run into the same threading problem. We have parallelization on a dataset level (you can evaluate multiple test cases in your dataset at once), but not on the test case level (you CAN'T evaluate multiple metrics on a test case at once)

As for your suggestion to allow multiple test cases on one metric, for now we won't be supporting it, mainly because it is an anti-pattern in deepeval. The purpose of measuring one test case per metric (eg. metric.measure(test_case)) is to allow users to build their own evaluation pipelines.

If you're looking to measure metric 'A' on test cases 'X', 'Y', 'Z', I would recommend doing this: https://docs.confident-ai.com/docs/evaluation-introduction#parallelization

We will be supporting async on assert_test and evaluate next week (so that you CAN evaluate multiple metrics on a test case at once)

AndresPrez commented 7 months ago

Interesting question on threads: So Im not 100% sure that OpenAI uses threads behind the scenes, I believe that they, and Vertex AI as well, may be using an async_client with libraries such as httpx, which I believe leverages http connection pooling to create a coroutine for each request. Not creating a different thread for each request, but rather multiple coroutines running on the same main one.

Anyways, I appreciate your quick responses and looking forward for the async support ❤️ , is there an issue or PR tracking this async progress?

penguine-ip commented 7 months ago

@AndresPrez no problem, there is nothing tracking it, but i see you joined discord, i'll be updating there :)

Jimmy-Newtron commented 1 month ago

We will be supporting async on assert_test and evaluate next week (so that you CAN evaluate multiple metrics on a test case at once)

Do you intend to support the reverse use case?

Can I evaluate multiple test cases on a single metric at once? Obviously the LLM can handle batches or parallel requests I had 4x speedup of my tests.

async def evaluate_llm_case(**kwargs):
    return evaluate(**kwargs)

async def assert_llm_cases(llm_test_cases, metrics: list[BaseMetric]):
        # Iterate over the test cases
        test_results = []
        for test_cases in llm_test_cases:
            tasks = [
                evaluate_llm_case(
                    test_cases=[llm_test_case],
                    metrics=metrics,
                    show_indicator=False,
                    write_cache=False,
                    print_results=False,
                )
                for llm_test_case in test_cases
            ]

            for result in await asyncio.gather(*tasks, return_exceptions=True):
                if not isinstance(result, BaseException):
                    test_results.extend(result)

        return aggregate_metrics(test_results)

The code was working a week ago, but now breaks on a missing conversational_instance_id due to the TestRunManager() singleton

penguine-ip commented 1 month ago

Hey @Jimmy-Newtron will add support for it next week. In the meantime can you please send out the full error message with the conversational instance id? Thanks!