[Feature] Add system metrics collected during evaluation to eval_output

acere commented 1 month ago

It would useful to collect system metrics, e.g. latency, during the evaluation and to provide a summary in the evaluation output.

Zhenshan-Jin commented 1 month ago

@acere Thanks for bringing it up. Just to clarify,

by system metrics, e.g. latency, do you mean the latency to call the evaluation model like detoxify or the overall latency to evaluate the model?
what is your use case to leverage these metrics?

Thank you!

athewsey commented 1 month ago

From my perspective, I'd like fmeval to help more with profiling model latency and cost - so the most interesting metrics to store would be:

The latency of the LLM under test, for each invocation in the dataset, and summarized (with mean, p50, p90, p99) for the overall evaluation
The number of input and output tokens, for models which report this in their response (for e.g. Claude 3 on Bedrock reports usage.input_tokens and usage.output_tokens. Other Bedrock models also provide it, but in different keys of the response)

It's important to consider and compare model quality in the context of cost to run and response latency when making selection decisions. Although these factors are workload-sensitive, fmeval is at least running a dataset of representative examples through the model at speed: So while it's no substitute for a dedicated performance test, it could give a very useful initial indication of trade-offs between output quality and speed/cost.

xiaoyi-cheng commented 1 month ago

Thanks for you feedback! We will add it to our roadmap and prioritize it.

acere commented 1 month ago

Exactly as @athewsey indicates.
Technical metrics such as latency, time to first token ,and tokens per second (that would require using a streaming interface for models that support it) are usually also part of a model evaluation. And some of these metrics are already provided in the response from the service, for example Anthropic Claude on Amazon Bedrock returns the number of input and output tokens in usage, while other can be inferred by timing the request and response.

aws / fmeval

[Feature] Add system metrics collected during evaluation to eval_output #280