SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.66k stars 1.06k forks source link

Update audio.py #1112

Open BBC-Esq opened 3 weeks ago

BBC-Esq commented 3 weeks ago

This re-introduces FFMPEG binary, which is much faster than AV (which bundles FFMPEG), to decode audio. It will only use FFMPEG directly if it's available; otherwise, it'll use AV just like before. Thus, there should be no impact whatsoever for users who traditionally don't install FFMPEG separately to use faster-whisper...while people who have FFMPEG somewhere on their system (as many do) will enjoy a nice speedup.

BENCHMARK comparison ## Pydub Backend: 1. **File Opening and Initial Setup** took **13.520851** seconds. Entire audio file loaded into memory. 2. **Decoding and Resampling** took **4.008535** seconds. Resampling and channel conversion performed in-memory. 3. **Converting to Numpy Array** took **0.312081** seconds. Audio data converted to numpy array and normalized. **convert_pydub** took **17.846182** seconds ## AV Backend: 1. **File Opening and Initial Setup** took **0.002802** seconds. File header opened and resampler created. Audio data not yet loaded. 2. **Decoding and Resampling** took **6.407928** seconds. Audio data read, decoded, and resampled in chunks. 3. **Converting to Numpy Array** took **2.538944** seconds. Processed audio frames converted to numpy array and normalized. **convert_av** took **9.008320** seconds ## FFmpeg Backend: 1. **File Opening and Initial Setup** took **0.000850** seconds. Temporary WAV file created. 2. **Decoding and Resampling** took **1.736469** seconds. Input file converted to WAV format using FFmpeg. 3. **Converting to Numpy Array** took **0.320413** seconds. WAV file read, converted to numpy array, and normalized. **convert_ffmpeg** took **2.072908** seconds

This benchmark was done using the Sam Altman podcast, which is over two hours long. Used an RTX 4090 + 13900k. Obviously, the "actual" seconds saved will be less for smaller files...But with that being said, if batch processing is used and/or large files are in-fact processed, the "relative" speedup by using FFMPEG is stark.

Again, faster-whisper's reliance on AV would not change, but there would simply be an option now to use FFMPEG directly via sub-process.

Pydub, another backend, is show for comparison only. Here's the benchmark script...just add a different file name to test something different:

BENCH SCRIPT HERE ``` import numpy as np import time import os from pydub import AudioSegment import av import subprocess import tempfile # ANSI escape code for green text GREEN = '\033[92m' RESET = '\033[0m' def timeit(func): def wrapper(*args, **kwargs): start = time.perf_counter() result = func(*args, **kwargs) end = time.perf_counter() print(f"{GREEN}{func.__name__} took {end - start:.6f} seconds{RESET}") return result return wrapper class AudioConverter: def __init__(self, input_file): self.input_file = input_file self.base_name = os.path.splitext(os.path.basename(input_file))[0] def time_step(self, step_name): start = time.perf_counter() return start, step_name def end_step(self, start, step_name, additional_info=""): end = time.perf_counter() print(f"{step_name} took {end - start:.6f} seconds. {additional_info}") @timeit def convert_pydub(self): """ This method: 1. Loads the entire audio file into memory. - "AudioSegment.from_file()" initially loads the entire file into memory 2. Performs resampling and channel conversion in-memory. 3. Converts the audio data to a numpy array. Note: The initial loading time includes file reading and decoding. """ start, step = self.time_step("1. File Opening and Initial Setup") audio = AudioSegment.from_file(self.input_file) self.end_step(start, step, "Entire audio file loaded into memory.") start, step = self.time_step("2. Decoding and Resampling") audio = audio.set_frame_rate(16000).set_channels(1) self.end_step(start, step, "Resampling and channel conversion performed in-memory.") start, step = self.time_step("3. Converting to Numpy Array") result = np.array(audio.get_array_of_samples()).astype(np.float32) / np.iinfo(np.int16).max self.end_step(start, step, "Audio data converted to numpy array and normalized.") return result @timeit def convert_av(self): """ This method: 1. Opens the audio file without loading it entirely into memory. - "container.decode(audio)" yields frames one at a time, allowing for true streaming processing without loading the entire file into memory 2. Creates a resampler for the desired output format. 3. Processes the audio in chunks, decoding and resampling each chunk. 4. Concatenates the processed chunks into a numpy array. Note: The decoding and resampling step includes the actual reading and processing of the audio data. """ start, step = self.time_step("1. File Opening and Initial Setup") container = av.open(self.input_file) audio = container.streams.audio[0] resampler = av.audio.resampler.AudioResampler( format='s16', layout='mono', rate=16000 ) self.end_step(start, step, "File header opened and resampler created. Audio data not yet loaded.") start, step = self.time_step("2. Decoding and Resampling") audio_frames = [] for frame in container.decode(audio): resampled_frames = resampler.resample(frame) for resampled_frame in resampled_frames: audio_frames.append(resampled_frame) self.end_step(start, step, "Audio data read, decoded, and resampled in chunks.") start, step = self.time_step("3. Converting to Numpy Array") if not audio_frames: result = np.array([]) else: result = np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames]).astype(np.float32) / np.iinfo(np.int16).max self.end_step(start, step, "Processed audio frames converted to numpy array and normalized.") return result @timeit def convert_ffmpeg(self): """ This method: 1. Creates a temporary WAV file. 2. Uses FFmpeg to convert the input to the temporary WAV file. 3. Reads the temporary WAV file and converts it to a numpy array. Note: This method first converts the input to a WAV file before processing, which can add overhead but ensures a consistent input format. """ if self.input_file.endswith('.wav'): return self.to_np(self.input_file) else: start, step = self.time_step("1. File Opening and Initial Setup") with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file: temp_file_path = temp_file.name self.end_step(start, step, "Temporary WAV file created.") try: start, step = self.time_step("2. Decoding and Resampling") subprocess.run([ 'ffmpeg', '-i', self.input_file, '-ac', '1', '-ar', '16000', temp_file_path, '-y' ], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) self.end_step(start, step, "Input file converted to WAV format using FFmpeg.") return self.to_np(temp_file_path) finally: os.remove(temp_file_path) def to_np(self, file_path): start, step = self.time_step("3. Converting to Numpy Array") with open(file_path, 'rb') as f: header = f.read(44) raw_data = f.read() samples = np.frombuffer(raw_data, dtype=np.int16) result = samples.astype(np.float32) / np.iinfo(np.int16).max self.end_step(start, step, "WAV file read, converted to numpy array, and normalized.") return result def benchmark(input_file): converter = AudioConverter(input_file) print("\nPydub Backend:") pydub_array = converter.convert_pydub() print("\nAV Backend:") av_array = converter.convert_av() print("\nFFmpeg Backend:") ffmpeg_array = converter.convert_ffmpeg() if __name__ == "__main__": input_file = r"D:\Scripts\bench_cupy\test1.flac" benchmark(input_file) ```
BBC-Esq commented 3 weeks ago

No idea why Black isn't being satisfied...sigh...

BBC-Esq commented 3 weeks ago

I've also tested the following, but it appears that FFMPEG just kicks ass so...hence this pull request:

Here's what I tested just FYI:

Handle Input Resampling To Mono Load in Memory Into Array and Normalize
pydub pydub pydub pydub numpy
av av.audio.resampler av av numpy
FFmpeg FFmpeg FFmpeg numpy numpy
av av.audio.resampler av av cupy
FFmpeg FFmpeg FFmpeg cupy cupy
FFmpeg cupyx.scipy.signal.resample_poly FFmpeg cupy cupy
MahmoudAshraf97 commented 3 weeks ago

You need to use black==23.* The way you check for FFMpeg existence usually fails more than it works especially with windows, I used the same method to check for perl in another project hence the advice

It seems that the community would take simplicity over speedups that depend on external packages, in #1106 we decided to ditch CuPy feature extraction regardless of the speedup because the increased code complexity, so I'll be hesitant to move forward with this PR to be honest since faster whisper already allows this functionality, you can easily import load_audio function from original whisper which uses ffmpeg directly and pass the audio to model.transcribe, or you can still have a dataloader on a separate thread to prepare audio while another file transcribes to maximize utilization

MahmoudAshraf97 commented 2 weeks ago

I also benchmarked on my side using your script after modifying it to run each method 10 times and average the time taken

34 min file:

AV Backend:
convert_av took 2.643623 seconds

FFmpeg Backend:
convert_ffmpeg took 2.135129 seconds

3 min file:

AV Backend:
convert_av took 0.273722 seconds

FFmpeg Backend:
convert_ffmpeg took 0.386438 seconds

av==13.1.0, ffmpeg==4.4.2

the performance is barely better for large files and actually slower for smaller files

BBC-Esq commented 2 weeks ago

That's very strange...Can you show the script please? I carefully created the benchmarking script to compare apples-to-apples, but humans do make mistakes sometimes so...

[EDIT] And do you happen to have the audio file tested? Can you try it on the one that my initial benchmark results were tested on...Here's a link to the file.

https://huggingface.co/datasets/reach-vb/random-audios/blob/main/sam_altman_lex_podcast_367.flac

MahmoudAshraf97 commented 2 weeks ago
Script ``` import numpy as np import time import os import av import subprocess import tempfile # ANSI escape code for green text GREEN = '\033[92m' RESET = '\033[0m' def timeit(func): def wrapper(*args, **kwargs): times = [] for i in range (10): start = time.perf_counter() result = func(*args, **kwargs) times.append(time.perf_counter() - start) print(f"{GREEN}{func.__name__} took {np.mean(times):.6f} seconds{RESET}") return result return wrapper class AudioConverter: def __init__(self, input_file): self.input_file = input_file self.base_name = os.path.splitext(os.path.basename(input_file))[0] @timeit def convert_av(self): """ This method: 1. Opens the audio file without loading it entirely into memory. - "container.decode(audio)" yields frames one at a time, allowing for true streaming processing without loading the entire file into memory 2. Creates a resampler for the desired output format. 3. Processes the audio in chunks, decoding and resampling each chunk. 4. Concatenates the processed chunks into a numpy array. Note: The decoding and resampling step includes the actual reading and processing of the audio data. """ # start, step = self.time_step("1. File Opening and Initial Setup") container = av.open(self.input_file) audio = container.streams.audio[0] resampler = av.audio.resampler.AudioResampler( format='s16', layout='mono', rate=16000 ) # self.end_step(start, step, "File header opened and resampler created. Audio data not yet loaded.") # start, step = self.time_step("2. Decoding and Resampling") audio_frames = [] for frame in container.decode(audio): resampled_frames = resampler.resample(frame) for resampled_frame in resampled_frames: audio_frames.append(resampled_frame) # self.end_step(start, step, "Audio data read, decoded, and resampled in chunks.") # start, step = self.time_step("3. Converting to Numpy Array") if not audio_frames: result = np.array([]) else: result = np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames]).astype(np.float32) / np.iinfo(np.int16).max # self.end_step(start, step, "Processed audio frames converted to numpy array and normalized.") return result @timeit def convert_ffmpeg(self): """ This method: 1. Creates a temporary WAV file. 2. Uses FFmpeg to convert the input to the temporary WAV file. 3. Reads the temporary WAV file and converts it to a numpy array. Note: This method first converts the input to a WAV file before processing, which can add overhead but ensures a consistent input format. """ if self.input_file.endswith('.wav'): return self.to_np(self.input_file) else: # start, step = self.time_step("1. File Opening and Initial Setup") with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file: temp_file_path = temp_file.name # self.end_step(start, step, "Temporary WAV file created.") try: # start, step = self.time_step("2. Decoding and Resampling") subprocess.run([ 'ffmpeg', '-i', self.input_file, '-ac', '1', '-ar', '16000', temp_file_path, '-y' ], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) # self.end_step(start, step, "Input file converted to WAV format using FFmpeg.") return self.to_np(temp_file_path) finally: os.remove(temp_file_path) def to_np(self, file_path): # start, step = self.time_step("3. Converting to Numpy Array") with open(file_path, 'rb') as f: header = f.read(44) raw_data = f.read() samples = np.frombuffer(raw_data, dtype=np.int16) result = samples.astype(np.float32) / np.iinfo(np.int16).max # self.end_step(start, step, "WAV file read, converted to numpy array, and normalized.") return result def benchmark(input_file): converter = AudioConverter(input_file) print("\nAV Backend:") av_array = converter.convert_av() print("\nFFmpeg Backend:") ffmpeg_array = converter.convert_ffmpeg() if __name__ == "__main__": input_file = r"/mnt/e/Projects/sam_altman_lex_podcast_367.flac" benchmark(input_file) ```

Results on your file:

AV Backend:
convert_av took 10.720308 seconds

FFmpeg Backend:
convert_ffmpeg took 6.279689 seconds
BBC-Esq commented 2 weeks ago

I used your script verbatim except used windows-style path for audio file:

AV Backend:
convert_av took 9.009444 seconds

FFmpeg Backend:
convert_ffmpeg took 2.128546 seconds

On a very long audio file, at least, it seems that straight-ffmpeg is faster for both of us. However, mine was 323% faster while yours was only ~71% faster.

Thoughts on our somewhat similar (but not entirely) results...

1) you use linux and i use windows 2) different versions of libraries 3) your discrepancy regarding a 3 min file being faster with av

linux v windows

You're using linux, right? Could be a difference...

different versions of libraries

I'm assuming you pip installed av and did not build it? If this is correct, your initial test was with av==3.1.0, but mine wasav==3.0.0. However, according to here and here both versions bundle ffmpeg 6.1.1 so no difference there...

Regarding the straight-ffmpeg test, you used ffmpeg version 4.4.2, and after running ffmpeg -version, I discovered that I used 2024-09-12-git-504c1ffcd8. According to here, version 4.4.2 came out in April, 2022...so this might be a source of the discrepancy if I'm testing a build from 9/12/2024...

As a side note, I'm not aware how to download a specific "version" like 4.4.2. I'm only aware how to get ffmpeg for windows by going here and downloading a build that someone created for Windows. If you're aware of how to download a specific "version" for Windows I can definitely try that...

your discrepancy for a 3 minute file

Since I wasn't given your specific 3-min file, I tested on a short 6 min .mp3 file I had. The results I got were:

AV Backend:
convert_av took 1.217185 seconds

FFmpeg Backend:
convert_ffmpeg took 0.448657 seconds

In summary, straight-ffmpeg was 2.7 times faster or, stated differently, 171% faster.

My only thought would be to re-test with a newer version of straight-ffmpeg and compare it to av 3.1.0...Also, with a "relatively" short (compared to the altman podcast anyways) audio file, there might have been some background activity on your computer during that test. True...that could happen with any test including mine too...but with a longer audio file it's less likely to have a meaningful impact, in theory of course.

Also, it defies commonsense...Let's assume hypothetically that we test using straight-ffmpeg against av that uses the same exact version of ffmpeg under the hood (obviating the versioning discrepancy above), and that we use the exact same parameters and/or methods (e.g. batch processing or what have you), there is no possible way that av is faster. One library that bootstraps another can't do anything but add overhead. AGAIN, assuming that we run ffmpeg the same exact way.

A final thought, neither of us are professional benchmarkers like "Gamer's Nexus" on Youtube with separate benchmarks and insane benchmarking protocols so take everything I say with a grain of salt.

Regarding detecting FFMPEG on windows

You previously expressed concerns about detecting ffmpeg on windows... I've had NO troubles detecting FFMPEG on my windows machine using my vector db program that uses whisper-s2t...it's either on the PATH or isn't...and as relevant to this pull request, faster-whisper would simply default to the way it currently operates...no harm no foul here. I'd recommend that you not bork this entire pull request for this concern.

You also suggested bootstrapping ffmpeg from openai's vanilla whisper library, thus this pull request is unnecessary...In light of your comments regarding code simplicity, I fail to see how that is any simpler than allowing faster-whisper to use ffmpeg if it's in the system PATH, and if not...do what faster-whisper already does. It seems the contrary...

I suggest that you keep this pull request open, subject to modification, unless/until there's a strong consensus that ffmpeg operates somehow differently on short files...

MahmoudAshraf97 commented 2 weeks ago

I upgraded ffmpeg to v7.1 and these are the results, note that ffmpeg results fluctuate greatly and I couldn't pin the reason

AV Backend:
convert_av took 11.649484 seconds

FFmpeg Backend:
convert_ffmpeg took 4.218129 seconds

AV Backend:
convert_av took 10.717828 seconds

FFmpeg Backend:
convert_ffmpeg took 3.347625 seconds

AV Backend:
convert_av took 10.838667 seconds

FFmpeg Backend:
convert_ffmpeg took 5.127887 seconds

Anyways, the main discussion here is not whether one is faster than the other because ffmpeg is superior in almost everyway except for the installation process, but the consensus since this repo was created was not to use ffmpeg for that exact reason, and because audio loading is rarely the bottleneck, if this represents a problem for some use case, there exist workarounds that involve no coding and are compatible with the current state of faster whisper (and backward compatible too), that's why I'm against merging this PR, I'll keep it open of course in case anyone wanted to chime in.

BBC-Esq commented 2 weeks ago

Thanks. Again, nothing changes from the user's perspective except that it'll use something that's faster if it's available. No code changes to a person creating a script that uses faster-whisper whatsoever.

You touch upon a core issue though...

I'd encourage faster-whisper to be more accommodating with pull requests; otherwise, it'll just become stagnant. The testing I do, and even responding in-depth to comments, take a significant amount of time that could be spent on other repositories, but I am trying to contribute to faster-whisper because I have soft spot for it. lol. This pull requests changes NOTHING as far as how faster-whisper operates when a person doesn't have FFMPEG in their system's PATH and ONLY offers a speedup for those that do...yet it's met with scrutiny that, in the terms of my legal profession, can only be described as "beyond a reasonable doubt."

Food for thought...

MahmoudAshraf97 commented 2 weeks ago

Before continuing with the discussion I should thank you for dedicating time and efforts to make this project better.

I, as a maintainer, have a different POV, the default is not to accept a PR unless there are solid reasons to do so, and if you check the old closed PRs you'll find that this was mostly the case. The case here is why not check for X in a user environment and use it to provide a speedup if it exists, this was the case with PyTorch in feature extraction and later CuPy and both were rejected because although the user interface will not change, the underlying backend will be cluttered with code that is used by a subset of the userbase, and while being used by a subset of users, it doesn't make maintaining the code any easier because now we must deal with bugs caused by both ffmpeg and pyav (or generally from any two packages in a similar scenario), so unless the PR improves things for all users.

I'd advice against accepting it and if it does I'll accept it if it can replace the old functionality because I'm against maintaining two pieces of code that have the same functionality even if one of them has an edge over the other. I recommend reading This and This to gain insights on my POV and other open source maintainers' as well.

BBC-Esq commented 2 weeks ago

You're welcome, and I've seen enough ranging from your willingness to participate in a private bloke like me's whisper benchmarking repository to interactions on here to know you're thoughts are genuine and you're doing what you think is best. Just my two cents is all, which you're free to adopt, modify or what have you.

Cheers!