Open AndresPrez opened 9 months ago
Hey @AndresPrez , what do you mean by evaluators
? (are you talking about the llamaindex integration?)
Do you have some code to clarify what you mean, thanks.
I updated the description, but by evaluators I meant the LLMs that get called. Some LLM SDKs support batch requests like Azure's. Let me complement the description with some reference links
@AndresPrez Can you let me know out of the three methods to measure a metric, how are you currently doing it? https://docs.confident-ai.com/docs/metrics-introduction#measuring-a-metric
@penguine-ip I'm actually trying out all of them, and for Custom LLM I'm inheriting deepeval's DeepEvalBaseLLM
class and implementing it with Google Vertex AI Models.
Now, for example OpenAI SDKs supports async completions (added references to description). So it would be awesome to extend deepeval's interface to support async measuring. That way we can run metric measuring on many test cases in "parallel". In addition to that they also support batching requests (i.e., sending multiple prompts in a single request), so another increase in parallelization can be achieved leveraging that.
@AndresPrez Does the openai async not use threads? I'm thinking you will run into the same threading problem. We have parallelization on a dataset level (you can evaluate multiple test cases in your dataset at once), but not on the test case level (you CAN'T evaluate multiple metrics on a test case at once)
As for your suggestion to allow multiple test cases on one metric, for now we won't be supporting it, mainly because it is an anti-pattern in deepeval. The purpose of measuring one test case per metric (eg. metric.measure(test_case)
) is to allow users to build their own evaluation pipelines.
If you're looking to measure metric 'A' on test cases 'X', 'Y', 'Z', I would recommend doing this: https://docs.confident-ai.com/docs/evaluation-introduction#parallelization
We will be supporting async on assert_test
and evaluate
next week (so that you CAN evaluate multiple metrics on a test case at once)
Interesting question on threads: So Im not 100% sure that OpenAI uses threads behind the scenes, I believe that they, and Vertex AI as well, may be using an async_client with libraries such as httpx, which I believe leverages http connection pooling to create a coroutine for each request. Not creating a different thread for each request, but rather multiple coroutines running on the same main one.
Anyways, I appreciate your quick responses and looking forward for the async support ❤️ , is there an issue or PR tracking this async progress?
@AndresPrez no problem, there is nothing tracking it, but i see you joined discord, i'll be updating there :)
We will be supporting async on
assert_test
andevaluate
next week (so that you CAN evaluate multiple metrics on a test case at once)
Do you intend to support the reverse use case?
Can I evaluate multiple test cases on a single metric at once? Obviously the LLM can handle batches or parallel requests I had 4x speedup of my tests.
async def evaluate_llm_case(**kwargs):
return evaluate(**kwargs)
async def assert_llm_cases(llm_test_cases, metrics: list[BaseMetric]):
# Iterate over the test cases
test_results = []
for test_cases in llm_test_cases:
tasks = [
evaluate_llm_case(
test_cases=[llm_test_case],
metrics=metrics,
show_indicator=False,
write_cache=False,
print_results=False,
)
for llm_test_case in test_cases
]
for result in await asyncio.gather(*tasks, return_exceptions=True):
if not isinstance(result, BaseException):
test_results.extend(result)
return aggregate_metrics(test_results)
The code was working a week ago, but now breaks on a missing conversational_instance_id due to the TestRunManager() singleton
Hey @Jimmy-Newtron will add support for it next week. In the meantime can you please send out the full error message with the conversational instance id? Thanks!
Usually these sort of evaluations are made on large datasets of Q&A interactions. Deepeval's interface however is implemented in a way that calls to the LLM Evaluators Agents are done sequentially and synchronously.
Describe the solution you'd like Deepeval's API interfaces could be extended to support either async calls and/or batched test cases. For example the
BaseMetric.measure()
function could be extended to either be an async function and/or accept a list of LLMTestCase objects as input.Describe alternatives you've considered I've tried to "paralelize" these on my end using
asyncio.to_thread(...)
function, however there's a limit on the number of threads I can generate efficiently.Additional context There's an increasingly number of SDKs now supporting async and/or batch requests to LLM providers. Such as OpenAI and Google Vertex AI.
References:
OpenAI Batching: https://platform.openai.com/docs/guides/production-best-practices/batching Async: https://github.com/openai/openai-python?tab=readme-ov-file#async-usage
Vertex AI Batching:
Async: https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextGenerationModel#vertexai_language_models_TextGenerationModel_predict_async