Open acere opened 1 month ago
@acere Thanks for bringing it up. Just to clarify,
Thank you!
From my perspective, I'd like fmeval to help more with profiling model latency and cost - so the most interesting metrics to store would be:
usage.input_tokens
and usage.output_tokens
. Other Bedrock models also provide it, but in different keys of the response)It's important to consider and compare model quality in the context of cost to run and response latency when making selection decisions. Although these factors are workload-sensitive, fmeval is at least running a dataset of representative examples through the model at speed: So while it's no substitute for a dedicated performance test, it could give a very useful initial indication of trade-offs between output quality and speed/cost.
Thanks for you feedback! We will add it to our roadmap and prioritize it.
Exactly as @athewsey indicates.
Technical metrics such as latency, time to first token ,and tokens per second (that would require using a streaming interface for models that support it) are usually also part of a model evaluation. And some of these metrics are already provided in the response from the service, for example Anthropic Claude on Amazon Bedrock returns the number of input and output tokens in usage, while other can be inferred by timing the request and response.
It would useful to collect system metrics, e.g. latency, during the evaluation and to provide a summary in the evaluation output.