SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.28k stars 1.03k forks source link

Cannot reproduce CPU benchmark numbers posted in README; 6x slower than posted time #534

Open jpgard opened 1 year ago

jpgard commented 1 year ago

I'm attempting to reproduce the benchmark numbers listed on the README, using the same audio.

The README indicates that I should be able to transcribe an MP3 file of the audio from this video using the small model, with fp32 and beam size 5, in around 2m44s (164s). However, when I transcribe that audio (using the script below, and the mp3 file I extracted from the video here) it takes 1037s, around 6x slower.

It's hard to know the exact details of how that benchmark was computed, though, because there is no script or audio file provided to reproduce it. I'm also not sure if any other hyperparameters/configurations were changed to achieve that result. However, it is concerning that I'm not able to reproduce anything close to this number, despite trying it on two different CPUs (2.3 GHz Dual-Core Intel Core i5, Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz).

Thank you!

import argparse
import logging
from datetime import datetime

logging.basicConfig()
logging.getLogger("faster_whisper").setLevel(logging.DEBUG)
import os

from faster_whisper import WhisperModel

start = datetime.now()
parser = argparse.ArgumentParser()
parser.add_argument("--audio-file", default=None, required=True,
                    help="Path to file to transcribe.")
parser.add_argument("--output-dir", default="transcriptions",
                    help="path to write transcription files to.")
parser.add_argument("--cpu-threads", default=8, type=int,
                    help="Number of threads to use when running on CPU (4 by default)."
                         "A non zero value overrides the OMP_NUM_THREADS environment variable.")
parser.add_argument("--num-workers", default=8, type=int,
                    help="When transcribe() is called from multiple Python threads,"
                         "having multiple workers enables true parallelism when running the model"
                         "(concurrent calls to self.model.generate() will run in parallel).")
parser.add_argument("--beam-size", default=5, type=int,
                    help="Beam size to use for decoding.")
parser.add_argument("--without-timestamps", default=False, action="store_true",
                    help="Only sample text tokens.")
parser.add_argument("--device", choices=["cpu", "cuda"],
                    required=True,
                    help="device to use")
parser.add_argument("--model-size", default="small",
                    choices=("tiny", "tiny.en", "base", "base.en",
                             "small", "small.en", "medium", "medium.en",
                             "large-v1", "large-v2", "large"))
args = parser.parse_args()

model = WhisperModel(
    args.model_size,
    device=args.device,
    cpu_threads=args.cpu_threads,
    num_workers=args.num_workers,
    compute_type="default")

segments, info = model.transcribe(
    args.audio_file,
)
segments = list(segments)  # The transcription will actually run here.

for segment in segments:
    print(segment.text)

duration = (datetime.now() - start).total_seconds()
print(f"execution completed in {duration}s")
jpgard commented 1 year ago

Here are some more numbers for this same audio file:

(8 threads, 1 worker, 5 beams): 1037.374998s (4 threads, 1 worker, 5 beams): 768.986466s (8 threads, 8 workers, 1 beam): 742.796928s (4 threads, 1 worker, 1 beam): 704.093647s (4 threads, 4 workers, 1 beam): 665.327561s (4 threads, 1 worker, 1 beam, 500ms vad filter): 647.515117s (1 thread 1 worker, 5 beams): 787.795953s (1 thread, 4 workers, 5 beams): 846.496325s (1 thread, 4 workers, 1 beam): 797.238037s

rjwilmsi commented 1 year ago

You mention two CPUs: a dual core i5 and a 10-core Xeon. What are your results for? A dual core i5 is presumably something like an Intel i5-2520M laptop CPU, which has about 12x lower multi-thread performance than your Xeon 6230. I would expect vastly different numbers from those two CPUs.

Have you got models pre-downloaded so aren't measuring download time?

The example video is in French, I only use the English models so can't compare directly. The README benchmark headline result for CPU is 13 minutes audio in 2 minutes, so 6.5x realtime, for the Xeon on int8, beam size 5.

Some comparison benchmarks from my setup. Ryzen 5600G, small.en model, int8, beam size 5, 4 threads, this English youtube video: https://www.youtube.com/watch?v=GFu64hnqzVo takes 54s so 7.2x realtime. Other variations of above: Ryzen 5600G 1 thread 118 seconds, 2 threads 74 seconds. 4 threads on Ryzen 4500U is 78 seconds.

So yes, broadly I can reconcile the README benchmark for CPU performance.