Open BBC-Esq opened 3 weeks ago
No idea why Black isn't being satisfied...sigh...
I've also tested the following, but it appears that FFMPEG just kicks ass so...hence this pull request:
Here's what I tested just FYI:
Handle Input | Resampling | To Mono | Load in Memory | Into Array and Normalize |
---|---|---|---|---|
pydub | pydub | pydub | pydub | numpy |
av | av.audio.resampler | av | av | numpy |
FFmpeg | FFmpeg | FFmpeg | numpy | numpy |
av | av.audio.resampler | av | av | cupy |
FFmpeg | FFmpeg | FFmpeg | cupy | cupy |
FFmpeg | cupyx.scipy.signal.resample_poly | FFmpeg | cupy | cupy |
You need to use black==23.*
The way you check for FFMpeg existence usually fails more than it works especially with windows, I used the same method to check for perl
in another project hence the advice
It seems that the community would take simplicity over speedups that depend on external packages, in #1106 we decided to ditch CuPy feature extraction regardless of the speedup because the increased code complexity, so I'll be hesitant to move forward with this PR to be honest since faster whisper already allows this functionality, you can easily import load_audio
function from original whisper which uses ffmpeg directly and pass the audio to model.transcribe
, or you can still have a dataloader on a separate thread to prepare audio while another file transcribes to maximize utilization
I also benchmarked on my side using your script after modifying it to run each method 10 times and average the time taken
34 min file:
AV Backend:
convert_av took 2.643623 seconds
FFmpeg Backend:
convert_ffmpeg took 2.135129 seconds
3 min file:
AV Backend:
convert_av took 0.273722 seconds
FFmpeg Backend:
convert_ffmpeg took 0.386438 seconds
av==13.1.0
, ffmpeg==4.4.2
the performance is barely better for large files and actually slower for smaller files
That's very strange...Can you show the script please? I carefully created the benchmarking script to compare apples-to-apples, but humans do make mistakes sometimes so...
[EDIT] And do you happen to have the audio file tested? Can you try it on the one that my initial benchmark results were tested on...Here's a link to the file.
https://huggingface.co/datasets/reach-vb/random-audios/blob/main/sam_altman_lex_podcast_367.flac
Results on your file:
AV Backend:
convert_av took 10.720308 seconds
FFmpeg Backend:
convert_ffmpeg took 6.279689 seconds
I used your script verbatim except used windows-style path for audio file:
AV Backend:
convert_av took 9.009444 seconds
FFmpeg Backend:
convert_ffmpeg took 2.128546 seconds
On a very long audio file, at least, it seems that straight-ffmpeg is faster for both of us. However, mine was 323% faster
while yours was only ~71% faster
.
1) you use linux and i use windows
2) different versions of libraries
3) your discrepancy regarding a 3 min file being faster with av
You're using linux, right? Could be a difference...
I'm assuming you pip installed av and did not build it?
If this is correct, your initial test was with av==3.1.0
, but mine wasav==3.0.0
. However, according to here and here both versions bundle ffmpeg 6.1.1 so no difference there...
Regarding the straight-ffmpeg test, you used ffmpeg version 4.4.2
, and after running ffmpeg -version
, I discovered that I used 2024-09-12-git-504c1ffcd8
. According to here, version 4.4.2 came out in April, 2022...so this might be a source of the discrepancy if I'm testing a build from 9/12/2024...
As a side note, I'm not aware how to download a specific "version" like 4.4.2. I'm only aware how to get ffmpeg for windows by going here and downloading a build that someone created for Windows. If you're aware of how to download a specific "version" for Windows I can definitely try that...
Since I wasn't given your specific 3-min file, I tested on a short 6 min .mp3 file I had. The results I got were:
AV Backend:
convert_av took 1.217185 seconds
FFmpeg Backend:
convert_ffmpeg took 0.448657 seconds
In summary, straight-ffmpeg was 2.7 times faster
or, stated differently, 171% faster
.
My only thought would be to re-test with a newer version of straight-ffmpeg and compare it to av 3.1.0...Also, with a "relatively" short (compared to the altman podcast anyways) audio file, there might have been some background activity on your computer during that test. True...that could happen with any test including mine too...but with a longer audio file it's less likely to have a meaningful impact, in theory of course.
Also, it defies commonsense...Let's assume hypothetically that we test using straight-ffmpeg against av that uses the same exact version of ffmpeg under the hood (obviating the versioning discrepancy above), and that we use the exact same parameters and/or methods (e.g. batch processing or what have you), there is no possible way that av is faster. One library that bootstraps another can't do anything but add overhead. AGAIN, assuming that we run ffmpeg the same exact way.
A final thought, neither of us are professional benchmarkers like "Gamer's Nexus" on Youtube with separate benchmarks and insane benchmarking protocols so take everything I say with a grain of salt.
You previously expressed concerns about detecting ffmpeg on windows...
I've had NO troubles detecting FFMPEG on my windows machine using my vector db program that uses whisper-s2t
...it's either on the PATH or isn't...and as relevant to this pull request, faster-whisper
would simply default to the way it currently operates...no harm no foul here. I'd recommend that you not bork this entire pull request for this concern.
You also suggested bootstrapping ffmpeg from openai's vanilla whisper library, thus this pull request is unnecessary...In light of your comments regarding code simplicity, I fail to see how that is any simpler than allowing faster-whisper to use ffmpeg if it's in the system PATH, and if not...do what faster-whisper already does. It seems the contrary...
I suggest that you keep this pull request open, subject to modification, unless/until there's a strong consensus that ffmpeg operates somehow differently on short files...
I upgraded ffmpeg
to v7.1
and these are the results, note that ffmpeg
results fluctuate greatly and I couldn't pin the reason
AV Backend:
convert_av took 11.649484 seconds
FFmpeg Backend:
convert_ffmpeg took 4.218129 seconds
AV Backend:
convert_av took 10.717828 seconds
FFmpeg Backend:
convert_ffmpeg took 3.347625 seconds
AV Backend:
convert_av took 10.838667 seconds
FFmpeg Backend:
convert_ffmpeg took 5.127887 seconds
Anyways, the main discussion here is not whether one is faster than the other because ffmpeg
is superior in almost everyway except for the installation process, but the consensus since this repo was created was not to use ffmpeg for that exact reason, and because audio loading is rarely the bottleneck, if this represents a problem for some use case, there exist workarounds that involve no coding and are compatible with the current state of faster whisper (and backward compatible too), that's why I'm against merging this PR, I'll keep it open of course in case anyone wanted to chime in.
Thanks. Again, nothing changes from the user's perspective except that it'll use something that's faster if it's available. No code changes to a person creating a script that uses faster-whisper whatsoever.
You touch upon a core issue though...
I'd encourage faster-whisper to be more accommodating with pull requests; otherwise, it'll just become stagnant. The testing I do, and even responding in-depth to comments, take a significant amount of time that could be spent on other repositories, but I am trying to contribute to faster-whisper because I have soft spot for it. lol. This pull requests changes NOTHING as far as how faster-whisper operates when a person doesn't have FFMPEG in their system's PATH and ONLY offers a speedup for those that do...yet it's met with scrutiny that, in the terms of my legal profession, can only be described as "beyond a reasonable doubt."
Food for thought...
Before continuing with the discussion I should thank you for dedicating time and efforts to make this project better.
I, as a maintainer, have a different POV, the default is not to accept a PR unless there are solid reasons to do so, and if you check the old closed PRs you'll find that this was mostly the case. The case here is why not check for X in a user environment and use it to provide a speedup if it exists, this was the case with PyTorch in feature extraction and later CuPy and both were rejected because although the user interface will not change, the underlying backend will be cluttered with code that is used by a subset of the userbase, and while being used by a subset of users, it doesn't make maintaining the code any easier because now we must deal with bugs caused by both ffmpeg and pyav (or generally from any two packages in a similar scenario), so unless the PR improves things for all users.
I'd advice against accepting it and if it does I'll accept it if it can replace the old functionality because I'm against maintaining two pieces of code that have the same functionality even if one of them has an edge over the other. I recommend reading This and This to gain insights on my POV and other open source maintainers' as well.
You're welcome, and I've seen enough ranging from your willingness to participate in a private bloke like me's whisper benchmarking repository to interactions on here to know you're thoughts are genuine and you're doing what you think is best. Just my two cents is all, which you're free to adopt, modify or what have you.
Cheers!
This re-introduces FFMPEG binary, which is much faster than AV (which bundles FFMPEG), to decode audio. It will only use FFMPEG directly if it's available; otherwise, it'll use AV just like before. Thus, there should be no impact whatsoever for users who traditionally don't install FFMPEG separately to use
faster-whisper
...while people who have FFMPEG somewhere on their system (as many do) will enjoy a nice speedup.BENCHMARK comparison
## Pydub Backend: 1. **File Opening and Initial Setup** took **13.520851** seconds. Entire audio file loaded into memory. 2. **Decoding and Resampling** took **4.008535** seconds. Resampling and channel conversion performed in-memory. 3. **Converting to Numpy Array** took **0.312081** seconds. Audio data converted to numpy array and normalized. **convert_pydub** took **17.846182** seconds ## AV Backend: 1. **File Opening and Initial Setup** took **0.002802** seconds. File header opened and resampler created. Audio data not yet loaded. 2. **Decoding and Resampling** took **6.407928** seconds. Audio data read, decoded, and resampled in chunks. 3. **Converting to Numpy Array** took **2.538944** seconds. Processed audio frames converted to numpy array and normalized. **convert_av** took **9.008320** seconds ## FFmpeg Backend: 1. **File Opening and Initial Setup** took **0.000850** seconds. Temporary WAV file created. 2. **Decoding and Resampling** took **1.736469** seconds. Input file converted to WAV format using FFmpeg. 3. **Converting to Numpy Array** took **0.320413** seconds. WAV file read, converted to numpy array, and normalized. **convert_ffmpeg** took **2.072908** secondsThis benchmark was done using the Sam Altman podcast, which is over two hours long. Used an RTX 4090 + 13900k. Obviously, the "actual" seconds saved will be less for smaller files...But with that being said, if batch processing is used and/or large files are in-fact processed, the "relative" speedup by using FFMPEG is stark.
Again,
faster-whisper
's reliance onAV
would not change, but there would simply be an option now to use FFMPEG directly viasub-process
.Pydub
, another backend, is show for comparison only. Here's the benchmark script...just add a different file name to test something different:BENCH SCRIPT HERE
``` import numpy as np import time import os from pydub import AudioSegment import av import subprocess import tempfile # ANSI escape code for green text GREEN = '\033[92m' RESET = '\033[0m' def timeit(func): def wrapper(*args, **kwargs): start = time.perf_counter() result = func(*args, **kwargs) end = time.perf_counter() print(f"{GREEN}{func.__name__} took {end - start:.6f} seconds{RESET}") return result return wrapper class AudioConverter: def __init__(self, input_file): self.input_file = input_file self.base_name = os.path.splitext(os.path.basename(input_file))[0] def time_step(self, step_name): start = time.perf_counter() return start, step_name def end_step(self, start, step_name, additional_info=""): end = time.perf_counter() print(f"{step_name} took {end - start:.6f} seconds. {additional_info}") @timeit def convert_pydub(self): """ This method: 1. Loads the entire audio file into memory. - "AudioSegment.from_file()" initially loads the entire file into memory 2. Performs resampling and channel conversion in-memory. 3. Converts the audio data to a numpy array. Note: The initial loading time includes file reading and decoding. """ start, step = self.time_step("1. File Opening and Initial Setup") audio = AudioSegment.from_file(self.input_file) self.end_step(start, step, "Entire audio file loaded into memory.") start, step = self.time_step("2. Decoding and Resampling") audio = audio.set_frame_rate(16000).set_channels(1) self.end_step(start, step, "Resampling and channel conversion performed in-memory.") start, step = self.time_step("3. Converting to Numpy Array") result = np.array(audio.get_array_of_samples()).astype(np.float32) / np.iinfo(np.int16).max self.end_step(start, step, "Audio data converted to numpy array and normalized.") return result @timeit def convert_av(self): """ This method: 1. Opens the audio file without loading it entirely into memory. - "container.decode(audio)" yields frames one at a time, allowing for true streaming processing without loading the entire file into memory 2. Creates a resampler for the desired output format. 3. Processes the audio in chunks, decoding and resampling each chunk. 4. Concatenates the processed chunks into a numpy array. Note: The decoding and resampling step includes the actual reading and processing of the audio data. """ start, step = self.time_step("1. File Opening and Initial Setup") container = av.open(self.input_file) audio = container.streams.audio[0] resampler = av.audio.resampler.AudioResampler( format='s16', layout='mono', rate=16000 ) self.end_step(start, step, "File header opened and resampler created. Audio data not yet loaded.") start, step = self.time_step("2. Decoding and Resampling") audio_frames = [] for frame in container.decode(audio): resampled_frames = resampler.resample(frame) for resampled_frame in resampled_frames: audio_frames.append(resampled_frame) self.end_step(start, step, "Audio data read, decoded, and resampled in chunks.") start, step = self.time_step("3. Converting to Numpy Array") if not audio_frames: result = np.array([]) else: result = np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames]).astype(np.float32) / np.iinfo(np.int16).max self.end_step(start, step, "Processed audio frames converted to numpy array and normalized.") return result @timeit def convert_ffmpeg(self): """ This method: 1. Creates a temporary WAV file. 2. Uses FFmpeg to convert the input to the temporary WAV file. 3. Reads the temporary WAV file and converts it to a numpy array. Note: This method first converts the input to a WAV file before processing, which can add overhead but ensures a consistent input format. """ if self.input_file.endswith('.wav'): return self.to_np(self.input_file) else: start, step = self.time_step("1. File Opening and Initial Setup") with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file: temp_file_path = temp_file.name self.end_step(start, step, "Temporary WAV file created.") try: start, step = self.time_step("2. Decoding and Resampling") subprocess.run([ 'ffmpeg', '-i', self.input_file, '-ac', '1', '-ar', '16000', temp_file_path, '-y' ], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) self.end_step(start, step, "Input file converted to WAV format using FFmpeg.") return self.to_np(temp_file_path) finally: os.remove(temp_file_path) def to_np(self, file_path): start, step = self.time_step("3. Converting to Numpy Array") with open(file_path, 'rb') as f: header = f.read(44) raw_data = f.read() samples = np.frombuffer(raw_data, dtype=np.int16) result = samples.astype(np.float32) / np.iinfo(np.int16).max self.end_step(start, step, "WAV file read, converted to numpy array, and normalized.") return result def benchmark(input_file): converter = AudioConverter(input_file) print("\nPydub Backend:") pydub_array = converter.convert_pydub() print("\nAV Backend:") av_array = converter.convert_av() print("\nFFmpeg Backend:") ffmpeg_array = converter.convert_ffmpeg() if __name__ == "__main__": input_file = r"D:\Scripts\bench_cupy\test1.flac" benchmark(input_file) ```