Feature Request: Add a tokens per second evaluation with Local Large Langage Models

As recent threadripper videos have referenced, there are some chips and gpus that are targets at AI Engineers.

It would be nice for there to be a standardized test harness for evaluating Hardware for LLM usage.

Here's a bit of background on the goals and objectives of evaluating an LLM: https://www.baseten.co/blog/llm-transformer-inference-guide/

I will note this is a complex and ever-evolving field and fully understand if this is closed without comment.

I also acknowledge this may not be something you want to be wholly responsible for and instead call to a service that does this.

LTTLabsOSS / markbench-tests