I am hoping to use a large number of CPUs for audio transcription. However, I've been benchmarking the performance of faster-whisper and am getting nothing close to the CPU times posted on the README, where it seems that faster-whisper should achieve a speedup of around 4.75x real-time. We are seeing more like 1x real-time, or worse, and I am wondering if you could help diagnose why?
I have a sample of around 1000 podcasts that I am using for transcription, on Intel Xeon Gold 6230 CPU (@ 2.10GHz). The podcasts are a mix of .mp3 files (80%) and .m4a files (20%). The full script I am using to transcribe is below.
I tried running transcription with 1, 4, and 8 threads each. However, the behavior is pretty weird: with 4 threads, the performance is particularly bad, but with 1 and 8 threads the performance is effectively identical, and approximately in real-time (transcription time is roughly equal to audio duration, as indicated by points along y=x).
My questions are:
Am I doing the transcription wrong/are there issues with this simple script?
How might you explain the differences between the performance here and the reported performance of ~4.75x real-time on the example audio given in the README?
Thank you again for your hard work on the library, it's a very valuable resource for the community!
Full transcription script:
import argparse
from datetime import datetime
import logging
logging.basicConfig()
logging.getLogger("faster_whisper").setLevel(logging.DEBUG)
import os
from faster_whisper import WhisperModel
start = datetime.now()
parser = argparse.ArgumentParser()
parser.add_argument("--audio-file", default=None, required=True,
help="Path to file to transcribe.")
parser.add_argument("--output-dir", default="transcriptions",
help="path to write transcription files to.")
parser.add_argument("--num-threads", default=8, type=int,
help="number of threads per worker")
args = parser.parse_args()
model_size = "small"
assert os.path.exists(args.audio_file), f"file {args.audio_file} does not exist."
if not os.path.exists(args.output_dir):
os.makedirs(args.output_dir)
model = WhisperModel(
model_size,
device="cpu",
cpu_threads=args.num_threads,
compute_type="default")
segments, info = model.transcribe(
args.audio_file)
segments = list(segments) # The transcription will actually run here.
for segment in segments:
print(segment)
# Get the filename, stripped of the extension
basename = os.path.basename(args.audio_file).rsplit(".")[0]
outfile = os.path.join(args.output_dir, basename+".txt")
print(f"writing transcription to {outfile}")
with open(outfile, "w") as f:
for segment in segments:
seg = "[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text)
f.write(seg + "\n")
print(f"execution completed in {datetime.now()-start}s")
Hi, thanks for the great library!
I am hoping to use a large number of CPUs for audio transcription. However, I've been benchmarking the performance of
faster-whisper
and am getting nothing close to the CPU times posted on the README, where it seems thatfaster-whisper
should achieve a speedup of around 4.75x real-time. We are seeing more like 1x real-time, or worse, and I am wondering if you could help diagnose why?I have a sample of around 1000 podcasts that I am using for transcription, on Intel Xeon Gold 6230 CPU (@ 2.10GHz). The podcasts are a mix of .mp3 files (80%) and .m4a files (20%). The full script I am using to transcribe is below.
I tried running transcription with 1, 4, and 8 threads each. However, the behavior is pretty weird: with 4 threads, the performance is particularly bad, but with 1 and 8 threads the performance is effectively identical, and approximately in real-time (transcription time is roughly equal to audio duration, as indicated by points along y=x).
My questions are:
Thank you again for your hard work on the library, it's a very valuable resource for the community!
Full transcription script: