Inference time on CPU much slower than posted benchmark

Hi, thanks for the great library!

I am hoping to use a large number of CPUs for audio transcription. However, I've been benchmarking the performance of faster-whisper and am getting nothing close to the CPU times posted on the README, where it seems that faster-whisper should achieve a speedup of around 4.75x real-time. We are seeing more like 1x real-time, or worse, and I am wondering if you could help diagnose why?

I have a sample of around 1000 podcasts that I am using for transcription, on Intel Xeon Gold 6230 CPU (@ 2.10GHz). The podcasts are a mix of .mp3 files (80%) and .m4a files (20%). The full script I am using to transcribe is below.

I tried running transcription with 1, 4, and 8 threads each. However, the behavior is pretty weird: with 4 threads, the performance is particularly bad, but with 1 and 8 threads the performance is effectively identical, and approximately in real-time (transcription time is roughly equal to audio duration, as indicated by points along y=x).

My questions are:

Am I doing the transcription wrong/are there issues with this simple script?
How might you explain the differences between the performance here and the reported performance of ~4.75x real-time on the example audio given in the README?

Thank you again for your hard work on the library, it's a very valuable resource for the community!

Full transcription script:

import argparse
from datetime import datetime
import logging
logging.basicConfig()
logging.getLogger("faster_whisper").setLevel(logging.DEBUG)
import os

from faster_whisper import WhisperModel

start = datetime.now()
parser = argparse.ArgumentParser()
parser.add_argument("--audio-file", default=None, required=True,
        help="Path to file to transcribe.")
parser.add_argument("--output-dir", default="transcriptions",
        help="path to write transcription files to.")
parser.add_argument("--num-threads", default=8, type=int,
        help="number of threads per worker")
args = parser.parse_args()

model_size = "small"

assert os.path.exists(args.audio_file), f"file {args.audio_file} does not exist."
if not os.path.exists(args.output_dir):
    os.makedirs(args.output_dir)

model = WhisperModel(
        model_size, 
        device="cpu",
        cpu_threads=args.num_threads,
        compute_type="default")

segments, info = model.transcribe(
        args.audio_file)
segments = list(segments)  # The transcription will actually run here.

for segment in segments:
    print(segment)

# Get the filename, stripped of the extension
basename = os.path.basename(args.audio_file).rsplit(".")[0]
outfile = os.path.join(args.output_dir, basename+".txt")
print(f"writing transcription to {outfile}")

with open(outfile, "w") as f:
    for segment in segments:
        seg = "[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text)
        f.write(seg + "\n")
print(f"execution completed in {datetime.now()-start}s")

SYSTRAN / faster-whisper

Inference time on CPU much slower than posted benchmark #526