Benchmark for WhisperAX & CLI

It would be great to start collecting reproducible performance benchmarks for supported hardware (e.g. A14+ and M1+). This should be a self-contained function that uses openai/whisper-base by default and optionally other versions that the benchmark submitter selects. Benchmarks should run on a standard set of audio files and reports should be in a digestible and shareable format:

Psuedo-code may look like this:

Detect current hardware and load the models that the user has chosen to benchmark (single, multiple, or all available models)
Download standard audio files from Hugging (jfk.wav for short-form, ted_60.wav and a sample clip from earnings22 for long-form transcriptions)
Generate the transcriptions over several iterations and runtime tabulate statistics.
- Runs in streaming and file-based "offline" mode - this will require streaming emulation
- Completes short-form bench and presents results before moving to long-form bench which can potentially take several minutes to complete
- Will want to track: time to first token, RTF, inference timings (for encoder and decoder), total pipeline timings (model load -> transcription result)
Export these into a markdown table with relevant device info, and current commit hash, which can be posted to GitHub for public tracking

References

Open ASR leaderboard benchmarks: https://github.com/huggingface/open_asr_leaderboard Nice script for collecting environment info: https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py

argmaxinc / WhisperKit

Benchmark for WhisperAX & CLI #28

References

Related Issue