Open UsernamesLame opened 2 months ago
Looked into the numpy array saving: https://numpy.org/doc/stable/reference/generated/numpy.save.html
We can save the converted audio files to disk before feeding them to the model. This way we technically bypass the need for PyDub and ffmpeg. It also means no launching background processes (PyDub with ffmpeg) to manipulate audio so it's ready for the model to ingest.
@UsernamesLame, Thanks for the ideas!
This is what I was trying to explain:
sound = AudioSegment.from_file(media_file_path)
sound = sound.set_frame_rate(constants.WHISPER_SAMPLE_RATE).set_channels(1)
arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
arr /= np.iinfo(samples[0].typecode).max
with open("file.npy", "wb") as file:
np.save(file, arr, allow_pickle=False)
I haven't tested it yet, but the idea is do all the operations we need on the numpy array ahead of time and then later we can just do something like:
array = np.load("file.npy")
_transcribe(array)
This way we can mass process our audio files before we load them into memory for whisper to process.
@abdeladim-s hopefully this makes sense now! The idea is to batch process hundreds if not thousands of audio files ahead of time in parallel (I can write a script to do this for us) and save them in a format we can just load into the model and get transcriptions back from.
Yes I know numpy should be fast but every context switch we can avoid the better.
Looked into the numpy array saving: numpy.org/doc/stable/reference/generated/numpy.save.html
We can save the converted audio files to disk before feeding them to the model. This way we technically bypass the need for PyDub and ffmpeg. It also means no launching background processes (PyDub with ffmpeg) to manipulate audio so it's ready for the model to ingest.
PyDub and ffmpeg are actually there for the conversion to numpy arrays!
If we have numpy arrays, why we would need to save them to disk ?
Looked into the numpy array saving: numpy.org/doc/stable/reference/generated/numpy.save.html We can save the converted audio files to disk before feeding them to the model. This way we technically bypass the need for PyDub and ffmpeg. It also means no launching background processes (PyDub with ffmpeg) to manipulate audio so it's ready for the model to ingest.
PyDub and ffmpeg are actually there for the conversion to numpy arrays! If we have numpy arrays, why we would need to save them to disk ?
Pre-processing. Every context switch we can avoid the better! Imagine transcribing thousands of files.
The current solution looks like this:
pywhispercpp -> PyDub -> ffmpeg -> PyDub -> pywhispercpp -> numpy -> pywhispercpp -> PyBind11 -> whisper -> PyBind11 -> pywhispercpp
With my proposal it would look more like this:
pywhispercpp -> numpy -> pywhispercpp -> PyBind11 -> whisper -> PyBind11 -> pywhispercpp
The goal isn't to make this a full replacement for the existing solution, but I tomorrow I'll write a demo showing an alternative to load data into the model cutting out as many context switches as possible to gain some performance.
This is what I was trying to explain:
sound = AudioSegment.from_file(media_file_path) sound = sound.set_frame_rate(constants.WHISPER_SAMPLE_RATE).set_channels(1) arr = np.array(sound.get_array_of_samples()).T.astype(np.float32) arr /= np.iinfo(samples[0].typecode).max with open("file.npy", "wb") as file: np.save(file, arr, allow_pickle=False)
I haven't tested it yet, but the idea is do all the operations we need on the numpy array ahead of time and then later we can just do something like:
array = np.load("file.npy") _transcribe(array)
This way we can mass process our audio files before we load them into memory for whisper to process.
@abdeladim-s hopefully this makes sense now! The idea is to batch process hundreds if not thousands of audio files ahead of time in parallel (I can write a script to do this for us) and save them in a format we can just load into the model and get transcriptions back from.
Yes I know numpy should be fast but every context switch we can avoid the better.
Okey, so the idea is to process large amount of files ?
But I think it's the same, if not worse, taking into consideration the overhead of saving and loading the files to/from disk. And you will need to wait for the conversion in any ways.
IO operations are worse than using memory.
This is what I was trying to explain:
sound = AudioSegment.from_file(media_file_path) sound = sound.set_frame_rate(constants.WHISPER_SAMPLE_RATE).set_channels(1) arr = np.array(sound.get_array_of_samples()).T.astype(np.float32) arr /= np.iinfo(samples[0].typecode).max with open("file.npy", "wb") as file: np.save(file, arr, allow_pickle=False)
I haven't tested it yet, but the idea is do all the operations we need on the numpy array ahead of time and then later we can just do something like:
array = np.load("file.npy") _transcribe(array)
This way we can mass process our audio files before we load them into memory for whisper to process. @abdeladim-s hopefully this makes sense now! The idea is to batch process hundreds if not thousands of audio files ahead of time in parallel (I can write a script to do this for us) and save them in a format we can just load into the model and get transcriptions back from. Yes I know numpy should be fast but every context switch we can avoid the better.
Okey, so the idea is to process large amount of files ? But I think it's the same, if not worse, taking into consideration the overhead of saving and loading the files to/from disk. And you will need to wait for the conversion in any ways. IO operations are worse than using memory.
IO operations are generally cheaper than context switches. I'll test this unless you want to.
I can also read the files into memory and store them in a BytesIO object and read from it like a filesystem object too. There's a lot of ways this can be taken. But I genuinely believe that avoiding context switches > IO
RE: deepcopy
You can create completely independent objects that are clones of existing objects. Think instead of myModel = Model, we do myModel = existingModel.deepclone(). So we don't read the model weights from disk again, but instead do an in memory copy.
IO operations are generally cheaper than context switches. I'll test this unless you want to.
I can also read the files into memory and store them in a BytesIO object and read from it like a filesystem object too. There's a lot of ways this can be taken. But I genuinely believe that avoiding context switches > IO
RE: deepcopy
You can create completely independent objects that are clones of existing objects. Think instead of myModel = Model, we do myModel = existingModel.deepclone(). So we don't read the model weights from disk again, but instead do an in memory copy.
Yes please, go ahead and test! experiments and Numbers will save us a lot of talk :) Looking forward it!
IO operations are generally cheaper than context switches. I'll test this unless you want to. I can also read the files into memory and store them in a BytesIO object and read from it like a filesystem object too. There's a lot of ways this can be taken. But I genuinely believe that avoiding context switches > IO RE: deepcopy You can create completely independent objects that are clones of existing objects. Think instead of myModel = Model, we do myModel = existingModel.deepclone(). So we don't read the model weights from disk again, but instead do an in memory copy.
Yes please, go ahead and test! experiments and Numbers will save us a lot of talk :) Looking forward it!
So I'm testing wth a sample 33mb mp3 and the results are promising. Pre-processing into a numpy array and saving to disk shrinks it to 5.4mb so we can definitely have an impact on memory footprint with a helper script! Let me test transcription performance.
I have numbers for you @abdeladim-s!
Here's the script:
Here's the results!
********************
using raw numpy array finished in 2.6472320556640625
********************
********************
using mp3 file inished in 26.56456184387207
********************
On a M1 Pro MBP with 16gb of ram, not using the Metal backend, using the base whisper model.
Told ya it would have an improvement on processing time to pre-process the audio files into Numpy arrays!
This computer has a memory bandwidth of 200GB/s, and disk bandwidth of around 4GB/s. Context switching costs more than just loading raw data into memory :)
I am going to chase every optimization I can like a dog chases its tail.
from pydub import AudioSegment
import numpy as np
sound = AudioSegment.from_file("audio.mp3")
sound = sound.set_frame_rate(1600).set_channels(1)
arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
with open("file.npy", "wb") as file:
np.save(file, arr, allow_pickle=False)
This is the pre-conversion script. I'm going to update WhisperWav to output numpy arrays that can be fed directly into the model.
So I tried pre-converting a few files. Most work, but at random Numpy will completely mangle the conversion to a ndarray and saving leading to UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data
with numpy.load.
If anyone has any idea why it's randomly mangling things I'd love help here.
Edit:
Yea I'm at a complete dead end as to why numpy insists on butchering audio files at random when saving. When it works, the speedups are insane. When it doesn't work, the errors are absolutely useless.
Edit 2:
I decided to see if Copilot could help. It suggested:
with open("audio.npy", "rb") as f:
audio_data = np.fromfile(f, dtype=np.float32)
And so far it seems to be working?
Ok so final comment for now. A 42m audio file at 101 mb once crushed to mono and audio bitrate set to 1600khz becomes a 17mb~ npy file.
Processing the npy file takes around 10 seconds. Processing the raw wav file takes around 63 seconds.
This doesn't seem like an error or unreasonable. Can someone else please try and reproduce? Are we literally spending that much time prepping the file?!
I still don't get what you are trying to achieve, but if I understand it correctly, it's basically the same as what I did, except that you are trying to dump and load the npy array, and you've made a deadly bug! lol
Also, when you did the experiment, why you didn't calculate the time needed to convert the files to npy, people are not moving around with dumped npy arrays of their media files :sweat_smile:
Here is what I think this should be:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from pywhispercpp.model import Model
import numpy as np
import time
from pydub import AudioSegment
def usenumpy():
# This part from your script should be included as well! ##########
sound = AudioSegment.from_file("audio.mp3")
# Here 16Khz not 1600 !!!! That's what you were doing wrong !!!
sound = sound.set_frame_rate(16000).set_channels(1)
arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
arr /= np.iinfo(np.int16).max # Normalization is important! otherwise you will get 'utf-8' codec can't decode bytes
# dump array to npy file
with open("file.npy", "wb") as file:
np.save(file, arr, allow_pickle=False)
####################
# load model
model = Model('base')
# load array from npy file
audio_data = np.load("file.npy")
segments = model.transcribe(audio_data)
for segment in segments:
print(segment)
def useaudiofile():
model = Model('base')
segments = model.transcribe("audio.mp3")
for segment in segments:
print(segment)
begin = time.time()
usenumpy()
end = time.time()
print("*" * 20)
print(f"using raw numpy array finished in {end - begin}")
print("*" * 20)
begin = time.time()
useaudiofile()
end = time.time()
print("*" * 20)
print(f"using mp3 file finished in {end - begin}")
print("*" * 20)
I used this file from my other project, here are the results:
[2024-08-30 17:34:17,168] {model.py:130} INFO - Transcribing ...
[2024-08-30 17:34:33,929] {model.py:133} INFO - Inference time: 16.761 s
t0=0, t1=424, text=[Music]
t0=424, t1=800, text=What exactly is artificial intelligence?
t0=800, t1=1192, text=We speak of AI when computer systems perform tasks
t0=1192, t1=1448, text=that usually require human intelligence.
t0=1448, t1=1624, text=This includes, for example,
t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.
t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.
t0=2624, t1=2824, text=This can be achieved in two ways.
t0=2824, t1=3000, text=[Music]
t0=3000, t1=3280, text=You can program each individual instruction
t0=3280, t1=3544, text=so that the machine solve the task step by step.
t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.
t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.
t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.
t0=5032, t1=5336, text=This is known as machine learning.
t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.
t0=5904, t1=6224, text=When we watch films, listen to music or shop online.
t0=6224, t1=6528, text=AI gives us recommendations about what we might like.
t0=6528, t1=7080, text=AI is capable of converting spoken language into text
t0=7080, t1=7312, text=and translating it into other languages.
t0=7312, t1=8040, text=AI is a central component of robotics.
t0=8040, t1=8288, text=Robots make our everyday lives easier
t0=8288, t1=8488, text=or take on strenuous activities.
t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI
t0=8984, t1=9096, text=and can react to it.
t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.
t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.
t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.
t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.
t0=11280, t1=11544, text=For example, on digital learning platforms.
t0=11544, t1=11928, text=AI is becoming increasingly important.
t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities
t0=12504, t1=12688, text=at home and at work.
t0=12688, t1=12896, text=And where we would rather make our own decisions.
t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.
t0=13512, t1=13840, text=For this, we need an AI-competent society.
t0=13840, t1=14176, text=[MUSIC PLAYING]
t0=14176, t1=14376, text=you
********************
using raw numpy array finished in 17.416718244552612
********************
[2024-08-30 17:34:34,516] {model.py:130} INFO - Transcribing ...
[2024-08-30 17:34:50,128] {model.py:133} INFO - Inference time: 15.612 s
t0=0, t1=424, text=[Music]
t0=424, t1=800, text=What exactly is artificial intelligence?
t0=800, t1=1192, text=We speak of AI when computer systems perform tasks
t0=1192, t1=1448, text=that usually require human intelligence.
t0=1448, t1=1624, text=This includes, for example,
t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.
t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.
t0=2624, t1=2824, text=This can be achieved in two ways.
t0=2824, t1=3000, text=[Music]
t0=3000, t1=3280, text=You can program each individual instruction
t0=3280, t1=3544, text=so that the machine solve the task step by step.
t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.
t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.
t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.
t0=5032, t1=5336, text=This is known as machine learning.
t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.
t0=5904, t1=6224, text=When we watch films, listen to music or shop online.
t0=6224, t1=6528, text=AI gives us recommendations about what we might like.
t0=6528, t1=7080, text=AI is capable of converting spoken language into text
t0=7080, t1=7312, text=and translating it into other languages.
t0=7312, t1=8040, text=AI is a central component of robotics.
t0=8040, t1=8288, text=Robots make our everyday lives easier
t0=8288, t1=8488, text=or take on strenuous activities.
t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI
t0=8984, t1=9096, text=and can react to it.
t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.
t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.
t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.
t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.
t0=11280, t1=11544, text=For example, on digital learning platforms.
t0=11544, t1=11928, text=AI is becoming increasingly important.
t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities
t0=12504, t1=12688, text=at home and at work.
t0=12688, t1=12896, text=And where we would rather make our own decisions.
t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.
t0=13512, t1=13840, text=For this, we need an AI-competent society.
t0=13840, t1=14176, text=[MUSIC PLAYING]
t0=14176, t1=14376, text=you
********************
using mp3 file finished in 16.196656465530396
********************
This is not a real experiment per say, but as you can see, they are almost the same. There is no need to dump and load the numpy array!
Lmk what do you think ?
I still don't get what you are trying to achieve, but if I understand it correctly, it's basically the same as what I did, except that you are trying to dump and load the npy array, and you've made a deadly bug! lol
Also, when you did the experiment, why you didn't calculate the time needed to convert the files to npy, people are not moving around with dumped npy arrays of their media files :sweat_smile:
Here is what I think this should be:
#!/usr/bin/env python # -*- coding: utf-8 -*- from pywhispercpp.model import Model import numpy as np import time from pydub import AudioSegment def usenumpy(): # This part from your script should be included as well! ########## sound = AudioSegment.from_file("audio.mp3") # Here 16Khz not 1600 !!!! That's what you were doing wrong !!! sound = sound.set_frame_rate(16000).set_channels(1) arr = np.array(sound.get_array_of_samples()).T.astype(np.float32) arr /= np.iinfo(np.int16).max # Normalization is important! otherwise you will get 'utf-8' codec can't decode bytes # dump array to npy file with open("file.npy", "wb") as file: np.save(file, arr, allow_pickle=False) #################### # load model model = Model('base') # load array from npy file audio_data = np.load("file.npy") segments = model.transcribe(audio_data) for segment in segments: print(segment) def useaudiofile(): model = Model('base') segments = model.transcribe("audio.mp3") for segment in segments: print(segment) begin = time.time() usenumpy() end = time.time() print("*" * 20) print(f"using raw numpy array finished in {end - begin}") print("*" * 20) begin = time.time() useaudiofile() end = time.time() print("*" * 20) print(f"using mp3 file finished in {end - begin}") print("*" * 20)
I used this file from my other project, here are the results:
[2024-08-30 17:34:17,168] {model.py:130} INFO - Transcribing ... [2024-08-30 17:34:33,929] {model.py:133} INFO - Inference time: 16.761 s t0=0, t1=424, text=[Music] t0=424, t1=800, text=What exactly is artificial intelligence? t0=800, t1=1192, text=We speak of AI when computer systems perform tasks t0=1192, t1=1448, text=that usually require human intelligence. t0=1448, t1=1624, text=This includes, for example, t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue. t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience. t0=2624, t1=2824, text=This can be achieved in two ways. t0=2824, t1=3000, text=[Music] t0=3000, t1=3280, text=You can program each individual instruction t0=3280, t1=3544, text=so that the machine solve the task step by step. t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions. t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves. t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions. t0=5032, t1=5336, text=This is known as machine learning. t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives. t0=5904, t1=6224, text=When we watch films, listen to music or shop online. t0=6224, t1=6528, text=AI gives us recommendations about what we might like. t0=6528, t1=7080, text=AI is capable of converting spoken language into text t0=7080, t1=7312, text=and translating it into other languages. t0=7312, t1=8040, text=AI is a central component of robotics. t0=8040, t1=8288, text=Robots make our everyday lives easier t0=8288, t1=8488, text=or take on strenuous activities. t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI t0=8984, t1=9096, text=and can react to it. t0=9096, t1=9568, text=AI is becoming increasingly important within medicine. t0=9568, t1=9840, text=It supports doctors when diagnosing diseases. t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis. t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities. t0=11280, t1=11544, text=For example, on digital learning platforms. t0=11544, t1=11928, text=AI is becoming increasingly important. t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities t0=12504, t1=12688, text=at home and at work. t0=12688, t1=12896, text=And where we would rather make our own decisions. t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us. t0=13512, t1=13840, text=For this, we need an AI-competent society. t0=13840, t1=14176, text=[MUSIC PLAYING] t0=14176, t1=14376, text=you ******************** using raw numpy array finished in 17.416718244552612 ******************** [2024-08-30 17:34:34,516] {model.py:130} INFO - Transcribing ... [2024-08-30 17:34:50,128] {model.py:133} INFO - Inference time: 15.612 s t0=0, t1=424, text=[Music] t0=424, t1=800, text=What exactly is artificial intelligence? t0=800, t1=1192, text=We speak of AI when computer systems perform tasks t0=1192, t1=1448, text=that usually require human intelligence. t0=1448, t1=1624, text=This includes, for example, t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue. t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience. t0=2624, t1=2824, text=This can be achieved in two ways. t0=2824, t1=3000, text=[Music] t0=3000, t1=3280, text=You can program each individual instruction t0=3280, t1=3544, text=so that the machine solve the task step by step. t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions. t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves. t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions. t0=5032, t1=5336, text=This is known as machine learning. t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives. t0=5904, t1=6224, text=When we watch films, listen to music or shop online. t0=6224, t1=6528, text=AI gives us recommendations about what we might like. t0=6528, t1=7080, text=AI is capable of converting spoken language into text t0=7080, t1=7312, text=and translating it into other languages. t0=7312, t1=8040, text=AI is a central component of robotics. t0=8040, t1=8288, text=Robots make our everyday lives easier t0=8288, t1=8488, text=or take on strenuous activities. t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI t0=8984, t1=9096, text=and can react to it. t0=9096, t1=9568, text=AI is becoming increasingly important within medicine. t0=9568, t1=9840, text=It supports doctors when diagnosing diseases. t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis. t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities. t0=11280, t1=11544, text=For example, on digital learning platforms. t0=11544, t1=11928, text=AI is becoming increasingly important. t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities t0=12504, t1=12688, text=at home and at work. t0=12688, t1=12896, text=And where we would rather make our own decisions. t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us. t0=13512, t1=13840, text=For this, we need an AI-competent society. t0=13840, t1=14176, text=[MUSIC PLAYING] t0=14176, t1=14376, text=you ******************** using mp3 file finished in 16.196656465530396 ********************
This is not a real experiment per say, but as you can see, they are almost the same. There is no need to dump and load the numpy array!
Lmk what do you think ?
I caught the deadly bug locally and fixed it locally.
As for performance, it's odd you aren't getting better results and I am.
I'm guessing it has something to due with the memory bandwidth of M1 Pro vs x86 chips?
But yea you're understanding now. I haven't tested it on x86. Also I didn't include the converting to numpy arrays because the idea is to mass transform it then transcribe.
At least one benefit is the numpy arrays are generally smaller in my experience.
What are your system specs btw? And Python version? I'm using 3.12 and getting good results.
If I can't increase performance I can at least lower memory usage I guess. 😅
My idea is to let the model be long lived and keep feeding it fresh areas dumps as it transcribes them one after another. This way in a different process (I'm going to edit e action to show this) we can spawn sub processes to mass convert media files to numpy arrays.
The idea is that the model is the limiting factor, as in most people don't have the CPU / RAM to load 2 - 4 models, so if we can pre-process the files so the model can transcribe faster with less memory, it's still a (small) win!
I have access to a 128core ARM box that is piss slow at transcribing but can quickly spit out these bumpy arrays.
It's not gonna benefit everyone, but it's worth exploring the thought. It's also possible to store all the numpy arrays in a single database that clients running the models pull from to transcribe creating transcription cluster. The big benefit being that the clients can be small like a raspberry pi and still considerably faster transcriptions.
I'm running a few more tests, including ensuring the numpy arrays produce the same results as the mp3, mostly because I can't believe that after crushing the frame rate, the response frequency, and the channels, I can go from 100mb to 70mb.
As of now, numpy is getting me 17 seconds while mp3 is 69 seconds. Timing the conversion to a numpy array gets me 5 seconds. So 21 seconds vs 69.
The performance gap has shrunk, but it's not gone. It's still ~3x faster to pre-process numpy arrays and then load them. I'm not saying everyone should, but it would make a fun example!
Edit:
I forgot to mention, I go us added to Whisper.cpp's README.md :)
https://github.com/ggerganov/whisper.cpp/pull/2396
Merged already. I felt like we were ready for more visibility.
I have an i7 8c/16t with 32 GB DDR4, running Python 3.10 .. When I tested the code provided with 1600 sample rate, I got results similar to yours, which is obvious because it's like 10x down-sampling, but when I fixed it it's almost the same, It's the same algorithm running under the hood anyway!
I can see the benefits of batch pre-processing, and this is exactly why I made the transcribe function accepts (audio file as well as numpy array) , if you want something quickly you can throw whatever file and the library will convert it for you, if you are a power user and you know what you are doing, you can use numpy arrays directly, in that case the pre-processing step will be ignored! I think from a library point of view this gives more flexibility to the users!
I'm running a few more tests, including ensuring the numpy arrays produce the same results as the mp3, mostly because I can't believe that after crushing the frame rate, the response frequency, and the channels, I can go from 100mb to 7mb.
As of now, numpy is getting me 17 seconds while mp3 is 69 seconds. Timing the conversion to a numpy array gets me 5 seconds. So 21 seconds vs 69.
The performance gap has shrunk, but it's not gone. It's still ~3x faster to pre-process numpy arrays and then load them. I'm not saying everyone should, but it would make a fun example!
Edit:
I forgot to mention, I go us added to Whisper.cpp's README.md :)
Merged already. I felt like we were ready for more visibility.
You can't tell from one example! You have to test multiple times and average the results, It's the same algorithm I used, so you should get basically the same results, unless there is some magic in dumping and loading the npy files
Oh, I just noticed you made a PR for this, you really think we are ready?! It's a small project, does not deserve that visibility :sweat_smile: But Thanks anyways!
I'm running a few more tests, including ensuring the numpy arrays produce the same results as the mp3, mostly because I can't believe that after crushing the frame rate, the response frequency, and the channels, I can go from 100mb to 7mb. As of now, numpy is getting me 17 seconds while mp3 is 69 seconds. Timing the conversion to a numpy array gets me 5 seconds. So 21 seconds vs 69. The performance gap has shrunk, but it's not gone. It's still ~3x faster to pre-process numpy arrays and then load them. I'm not saying everyone should, but it would make a fun example! Edit: I forgot to mention, I go us added to Whisper.cpp's README.md :) ggerganov/whisper.cpp#2396 Merged already. I felt like we were ready for more visibility.
- You can't tell from one example! You have to test multiple times and average the results, It's the same algorithm I used, so you should get basically the same results, unless there is some magic in dumping and loading the npy files
- Oh, I just noticed you made a PR for this, you really think we are ready?! It's a small project, does not deserve that visibility 😅 But Thanks anyways!
Re testing: I know one test isn't enough, but still it's promising!
Re pywhispercpp: It 100% deserves the visibility!
Also I double checked I'm using 16000 locally, and:
********************
using raw numpy array finished in 10.000927925109863
********************
That's still a pretty drastic difference. Also, when I accidentally did it with 1600, there was no real drop in accuracy on simpler audio files.
Let's put my numpy theories to the test. I'm going to crush around 6h of audio into numpy arrays and transcribe it.
- I think It should not be a drastic difference in my opinion, as long as you are using the same algorithm as _load_audio.
- If you have numpy arrays you can pass them through the transcribe function without any problem, as I said, the pre-processing step won't be executed!
- Or maybe I am wrong and I missed something! and I need to make an optimization somewhere!
It's really down to batch processing and pre-normalizing the numpy arrays making a very big difference on ARM (M1 Pro). I'm going to test feeding around 7.5h of audio into it and post the results.
Edit:
Just over 6gb of files converted into numpy arrays in 33 seconds. Time to transcribe!
Edit 2:
Whisper just spat out some debug logs. 174 seconds to transcribe 1h of audio with normalized numpy arrays!
Extrapolating this, it should take 17 minutes to transcribe >6h of audio. Lets see what actually happens as whisper spat out another debug log saying it finished in 147 seconds.
********************
using raw numpy array finished in 1105.0404160022736
********************
neat!
Edit:
We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes.
So far the speed up isn't that promising, but the next check should be memory usage!
https://github.com/EtienneAb3d/WhisperHallu?tab=readme-ov-file
I found this, a project about optimizing for whisper!
******************** using raw numpy array finished in 1105.0404160022736 ********************
neat!
Edit:
We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes.
So far the speed up isn't that promising, but the next check should be memory usage!
Interesting result!
I found this, a project about optimizing for whisper!
Sounds great, I'll take a look
********************
using wav finished in 1575.6269478797913
********************
ouch!
******************** using raw numpy array finished in 1105.0404160022736 ********************
neat! Edit: We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes. So far the speed up isn't that promising, but the next check should be memory usage!
Interesting result!
26 minutes for raw wav files, 17 minutes with numpy arrays.
I think we have a winner? Opinion?
Next test will be memory usage I guess.
interesting! .. I think it's because of the parallel pre-conversion of the files to numpy. For small number of files, this won't have a huge effect! But I have an idea, if you can replicate the same on Colab, that will give us a clear view of what's really happening in a fresh environnement!
interesting! .. I think it's because of the parallel pre-conversion of the files to numpy. For small number of files, this won't have a huge effect! But I have an idea, if you can replicate the same on Colab, that will give us a clear view of what's really happening in a fresh environnement!
I've never used colab before, so here's the code.
from pywhispercpp.model import Model
import numpy as np
import time
import os
from glob import glob
model = Model('base')
def usenumpy():
files = [f for f in glob("*") if os.path.isfile(f) and f.endswith((".pyd"))]
for file in files:
with open(f"{file}", "rb") as f:
audio_data = np.fromfile(f, dtype=np.float32)
numpy_segments = model.transcribe(audio_data)
def usewav():
files = [f for f in glob("*") if os.path.isfile(f) and f.endswith((".wav"))]
for file in files:
mp3_segments = model.transcribe(file)
begin = time.time()
usewav()
end = time.time()
print("*" * 20)
print(f"using wav finished in {end - begin}")
print("*" * 20)
I used cobalt.tools to download a 1.5h video's audio from YouTube as a WAV, then converted it with this:
from pydub import AudioSegment
import numpy as np
from glob import glob
import os
import time
begin = time.time()
files = [f for f in glob("*") if os.path.isfile(f) and not f.endswith((".npy", ".md", ".txt", ".py", ".cfg"))]
for file in files:
sound = AudioSegment.from_file(file)
sound = sound.set_frame_rate(16000).set_channels(1)
numpy_array = np.array(sound.get_array_of_samples()).T.astype(np.float32)
numpy_array /= np.iinfo(np.int16).max
with open(f"{file}.npy", "wb") as f:
np.save(f, numpy_array, allow_pickle=False)
end = time.time()
print(f"{end - begin} seconds elsapsed")
I feel like it should be ok to feed it the same audio file 6 times to get a general idea as it seems like whisper performs worse with each pass, not better.
If you want to make a colab / Jupiter notebook, I'll gladly poke around with you. My theory is that the audio files being massive is causing the issue. The numpy arrays I save to disk are much smaller by comparison. The .wav is around 1gb, the .pyk is around 393mb.
Anyways, for now I must say goodnight my friend! Don't let the geese bite!
So, the large files are causing the issue ?! Probably! But I am still confused, why, convert -> save -> load -> transcribe is faster than convert -> transcribe.
Anyways, good luck with your exploration, let me know if find any optimizations we can add to the repo, Goodnight :)
So, the large files are causing the issue ?! Probably!
But I am still confused, why, convert -> save -> load -> transcribe is faster than convert -> transcribe.
Anyways, good luck with your exploration, let me know if find any optimizations we can add to the repo,
Goodnight :)
The conversion ahead is faster because we're just converting?
I'm not sure to be honest.
Probably! I am confused to be honest.
Probably!
I am confused to be honest.
Same here to be completely honest.
It's not like the files are small even after conversion. I guess it's just the context switches I mentioned are really that bad.
Okay, let's leave this open for now. Hopefully, we will get opinions and experiments from others as well!
******************** using raw numpy array finished in 1105.0404160022736 ********************
neat! Edit: We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes. So far the speed up isn't that promising, but the next check should be memory usage!
Interesting result!
26 minutes for raw wav files, 17 minutes with numpy arrays.
I think we have a winner? Opinion?
Next test will be memory usage I guess.
@abdeladim-s and @UsernamesLame
This conversation caught my attention for some ungodly reason...Anyways, here's my contribution...try using the av
library instead of pydub.
Try a script like this and let's see the speed up of the conversion to numpy
compared to pydub
👍
import av
import numpy as np
from glob import glob
import os
import time
def convert_to_numpy(file):
container = av.open(file)
audio = container.streams.audio[0]
resampler = av.audio.resampler.AudioResampler(
format='s16',
layout='mono',
rate=16000
)
audio_frames = []
for frame in container.decode(audio):
resampled_frames = resampler.resample(frame)
for resampled_frame in resampled_frames:
audio_frames.append(resampled_frame)
if not audio_frames:
return np.array([])
numpy_array = np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames])
numpy_array = numpy_array.astype(np.float32)
numpy_array /= np.iinfo(np.int16).max
return numpy_array
begin = time.time()
files = [f for f in glob("*") if os.path.isfile(f) and not f.endswith((".npy", ".md", ".txt", ".py", ".cfg"))]
for file in files:
numpy_array = convert_to_numpy(file)
with open(f"{file}.npy", "wb") as f:
np.save(f, numpy_array, allow_pickle=False)
end = time.time()
print(f"{end - begin} seconds elapsed")
@BBC-Esq, Yes, PyAV is a great library too, but like Pydub, it uses ffmpeg under the hood. Therefore, I believe both will offer similar execution times for our use case.
@BBC-Esq, Yes, PyAV is a great library too, but like Pydub, it uses ffmpeg under the hood. Therefore, I believe both will offer similar execution times for our use case.
I benched both libraries and initially pydub
was faster at 18 seconds and av
was slower, but then upon reviewing the documentation I found a way to get av
down to 9 seconds. Although they both use ffmpg I'm guessing it's because of the different pipelines and usage that each one offers. I love pydub for ease of use, but it hasn't been updated since 2021 and av is massively maintained, albeit it's more complicated.
Test them out and let me know!
BTW, this was converting the Sam Altman .flac file into a numpy file. It's approximately two hours long...but I'm sure there's ways one could batch multiple files as well.
@BBC-Esq, Yes, PyAV is a great library too, but like Pydub, it uses ffmpeg under the hood. Therefore, I believe both will offer similar execution times for our use case.
I benched both libraries and initially
pydub
was faster at 18 seconds andav
was slower, but then upon reviewing the documentation I found a way to getav
down to 9 seconds. Although they both use ffmpg I'm guessing it's because of the different pipelines and usage that each one offers. I love pydub for ease of use, but it hasn't been updated since 2021 and av is massively maintained, albeit it's more complicated.Test them out and let me know!
BTW, this was converting the Sam Altman .flac file into a numpy file. It's approximately two hours long...but I'm sure there's ways one could batch multiple files as well.
Okay, so I've tested them out and found that sometimes Pydub is faster, while other times the execution times are similar. You can find the code in this gist I made.
Let me know how you made PyAV faster!
@BBC-Esq, Yes, PyAV is a great library too, but like Pydub, it uses ffmpeg under the hood. Therefore, I believe both will offer similar execution times for our use case.
I benched both libraries and initially
pydub
was faster at 18 seconds andav
was slower, but then upon reviewing the documentation I found a way to getav
down to 9 seconds. Although they both use ffmpg I'm guessing it's because of the different pipelines and usage that each one offers. I love pydub for ease of use, but it hasn't been updated since 2021 and av is massively maintained, albeit it's more complicated. Test them out and let me know! BTW, this was converting the Sam Altman .flac file into a numpy file. It's approximately two hours long...but I'm sure there's ways one could batch multiple files as well.Okay, so I've tested them out and found that sometimes Pydub is faster, while other times the execution times are similar. You can find the code in this gist I made.
Let me know how you made PyAV faster!
Sure, here's the benchmarking script that I used, you'd just add your own custom path to an audio file at the bottom:
import numpy as np
import time
import os
from pydub import AudioSegment
import av
def timeit(func):
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
end = time.perf_counter()
print(f"{func.__name__} took {end - start:.6f} seconds")
return result
return wrapper
class AudioConverter:
def __init__(self, input_file):
self.input_file = input_file
self.base_name = os.path.splitext(os.path.basename(input_file))[0]
def convert_pydub(self):
start_time = time.perf_counter()
audio = AudioSegment.from_file(self.input_file)
audio = audio.set_frame_rate(16000).set_channels(1)
@timeit
def np_array_conversion():
return np.array(audio.get_array_of_samples())
samples = np_array_conversion()
@timeit
def np_float_conversion():
return samples.astype(np.float32)
audio_array = np_float_conversion()
@timeit
def np_normalization(arr):
return arr / np.iinfo(np.int16).max
audio_array = np_normalization(audio_array)
output_file = f"{self.base_name}_pydub.npy"
@timeit
def np_save(arr, file):
np.save(file, arr)
np_save(audio_array, output_file)
end_time = time.perf_counter()
return end_time - start_time
def convert_av(self):
start_time = time.perf_counter()
container = av.open(self.input_file)
audio = container.streams.audio[0]
# Set up the resampler
resampler = av.audio.resampler.AudioResampler(
format='s16',
layout='mono',
rate=16000
)
@timeit
def get_array_of_samples():
audio_frames = []
for frame in container.decode(audio):
resampled_frames = resampler.resample(frame)
for resampled_frame in resampled_frames:
audio_frames.append(resampled_frame)
if not audio_frames:
return np.array([])
# Concatenate all frames into a single numpy array
return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames])
audio_array = get_array_of_samples()
@timeit
def np_float_conversion(arr):
return arr.astype(np.float32)
audio_array = np_float_conversion(audio_array)
@timeit
def np_normalization(arr):
return arr / np.iinfo(np.int16).max
audio_array = np_normalization(audio_array)
output_file = f"{self.base_name}_av.npy"
@timeit
def np_save(arr, file):
np.save(file, arr)
np_save(audio_array, output_file)
end_time = time.perf_counter()
return end_time - start_time
def benchmark(input_file):
converter = AudioConverter(input_file)
pydub_time = converter.convert_pydub()
print(f"Pydub conversion took {pydub_time:.6f} seconds")
av_time = converter.convert_av()
print(f"AV conversion took {av_time:.6f} seconds")
if __name__ == "__main__":
input_file = r"D:\Scripts\bench_cupy\test_flac.flac"
benchmark(input_file)
Can you test your script on the same audio file I did...the sam altman interview?
https://huggingface.co/datasets/reach-vb/random-audios/blob/main/sam_altman_lex_podcast_367.flac
In my above script, these are the results processing the altman audio:
np_array_conversion took 0.083757 seconds
np_float_conversion took 0.102801 seconds
np_normalization took 0.100766 seconds
np_save took 0.271963 seconds
Pydub conversion took 18.740992 seconds
get_array_of_samples took 8.872952 seconds
np_float_conversion took 0.081948 seconds
np_normalization took 0.102745 seconds
np_save took 0.274487 seconds
AV conversion took 9.364995 seconds
I "borrowed" from your script merging a couple steps, which was a cool idea, and there was a small speed increase for pydub:
pydub_to_numpy took 0.312771 seconds
np_save took 0.306345 seconds
Pydub conversion took 17.790903 seconds
av_to_numpy took 8.956160 seconds
np_save took 0.292697 seconds
AV conversion took 9.253701 seconds
As an ancillary matter, I spend a fair amount of time testing using ffmpeg directly using os
in python to run in command line...It was about 8% faster than av
no matter no matter how fast I could get av
to run, which makes sense considering that av
merely wraps ffmpeg and there must be some overhead....And obviously this just pertains to the audio handling and not creating the numpy array/file... With that being said, the benefit of av
(or pydub
for that matter) is that user's don't have to separately install and add to PATH, which the average non-programmer doesn't know how to do so...
Anyways, just thought it was an interesting conversation and wanted to experiment with it.
I also benched cupy
, which allows GPU-acceleration for a lot of numpy's operations (straight cuda and roc-m btw). I'm holding back that script until I perfect it though...It's awesome but I need to get the batch processing optimized. Hehe...
Let me know if my script gives you different results than I got for some reason...
If you want to see how similar the arrays are you can use something like this as well...
def compare_npy_files(file1, file2, file3):
arr1 = np.load(file1)
arr2 = np.load(file2)
arr3 = cp.asnumpy(cp.load(file3))
# Compare shapes and adjust if necessary
min_length = min(arr1.size, arr2.size, arr3.size)
arr1 = arr1[:min_length]
arr2 = arr2[:min_length]
arr3 = arr3[:min_length]
diff_12 = arr1 - arr2
diff_13 = arr1 - arr3
diff_23 = arr2 - arr3
abs_diff_12 = np.abs(diff_12)
abs_diff_13 = np.abs(diff_13)
abs_diff_23 = np.abs(diff_23)
# Calculate and print histogram of differences
print("\nHistogram of absolute differences:")
for diff, label in [(abs_diff_12, "Pydub vs AV"),
(abs_diff_13, "Pydub vs AV CuPy"),
(abs_diff_23, "AV vs AV CuPy")]:
hist, bin_edges = np.histogram(diff, bins=10)
print(f"\n{label}:")
for i, (start, end) in enumerate(zip(bin_edges[:-1], bin_edges[1:])):
print(f"{start:.2e} to {end:.2e}: {hist[i]} samples")
Okay, so I tested the script you provided.
First off, we don't need to dump the array to .npy, so I'll comment that part out. Actually, I'm not interested in the other parts except for the actual conversion to NumPy, since the other parts are just NumPy operations and should be basically the same. Surprisingly, in your results, np.save shows a huge difference between the two implementations, which indicates that something might be wrong.
Second, the wrapper timeit isn't a good measure for benchmarking because you can't draw conclusions from just one execution. That's why Python has the timeit utility.
But anyways, let's proceed with the script.
import numpy as np
import time
import os
from pydub import AudioSegment
import av
def timeit(func):
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
end = time.perf_counter()
print(f"{func.__name__} took {end - start:.6f} seconds")
return result
return wrapper
class AudioConverter:
def __init__(self, input_file):
self.input_file = input_file
self.base_name = os.path.splitext(os.path.basename(input_file))[0]
def convert_pydub(self):
start_time = time.perf_counter()
audio = AudioSegment.from_file(self.input_file)
audio = audio.set_frame_rate(16000).set_channels(1)
@timeit
def np_array_conversion():
return np.array(audio.get_array_of_samples())
samples = np_array_conversion()
@timeit
def np_float_conversion():
return samples.astype(np.float32)
audio_array = np_float_conversion()
@timeit
def np_normalization(arr):
return arr / np.iinfo(np.int16).max
audio_array = np_normalization(audio_array)
# output_file = f"{self.base_name}_pydub.npy"
## @timeit
# def np_save(arr, file):
# np.save(file, arr)
# np_save(audio_array, output_file)
end_time = time.perf_counter()
return end_time - start_time, audio_array
def convert_av(self):
start_time = time.perf_counter()
container = av.open(self.input_file)
audio = container.streams.audio[0]
# Set up the resampler
resampler = av.audio.resampler.AudioResampler(
format='s16',
layout='mono',
rate=16000
)
@timeit
def get_array_of_samples():
audio_frames = []
for frame in container.decode(audio):
resampled_frames = resampler.resample(frame)
for resampled_frame in resampled_frames:
audio_frames.append(resampled_frame)
if not audio_frames:
return np.array([])
# Concatenate all frames into a single numpy array
return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames])
audio_array = get_array_of_samples()
@timeit
def np_float_conversion(arr):
return arr.astype(np.float32)
audio_array = np_float_conversion(audio_array)
@timeit
def np_normalization(arr):
return arr / np.iinfo(np.int16).max
audio_array = np_normalization(audio_array)
# output_file = f"{self.base_name}_av.npy"
## @timeit
# def np_save(arr, file):
# np.save(file, arr)
# # np_save(audio_array, output_file)
end_time = time.perf_counter()
return end_time - start_time, audio_array
def benchmark(input_file):
converter = AudioConverter(input_file)
pydub_time, pydub_array = converter.convert_pydub()
print(f"Pydub conversion took {pydub_time:.6f} seconds")
print(pydub_array.shape)
av_time, av_array = converter.convert_av()
print(f"AV conversion took {av_time:.6f} seconds")
print(av_array.shape)
assert np.array_equal(pydub_array, av_array) is True
if __name__ == "__main__":
# input_file = "audio.mp3"
input_file = "/content/sam_altman_lex_podcast_367.flac"
benchmark(input_file)
And here are the colab results
np_array_conversion took 0.567625 seconds
np_float_conversion took 0.081097 seconds
np_normalization took 0.215539 seconds
Pydub conversion took 30.776541 seconds
(138181951,)
get_array_of_samples took 29.561768 seconds
np_float_conversion took 0.181955 seconds
np_normalization took 0.287005 seconds
AV conversion took 30.038430 seconds
(138181934,)
As you can see, the conversion time is almost the same, but more importantly, the arrays are not equal—there are some missing numbers in your AV implementation. Unless both implementations produce the same arrays, the comparison doesn't make much sense!
Can you test your script on the same audio file I did...the sam altman interview?
https://huggingface.co/datasets/reach-vb/random-audios/blob/main/sam_altman_lex_podcast_367.flac
How's the difference between Flac, mp3, wav, etc?
Okay, so I tested the script you provided.
First off, we don't need to dump the array to .npy, so I'll comment that part out. Actually, I'm not interested in the other parts except for the actual conversion to NumPy, since the other parts are just NumPy operations and should be basically the same. Surprisingly, in your results, np.save shows a huge difference between the two implementations, which indicates that something might be wrong.
Second, the wrapper timeit isn't a good measure for benchmarking because you can't draw conclusions from just one execution. That's why Python has the timeit utility.
But anyways, let's proceed with the script.
import numpy as np import time import os from pydub import AudioSegment import av def timeit(func): def wrapper(*args, **kwargs): start = time.perf_counter() result = func(*args, **kwargs) end = time.perf_counter() print(f"{func.__name__} took {end - start:.6f} seconds") return result return wrapper class AudioConverter: def __init__(self, input_file): self.input_file = input_file self.base_name = os.path.splitext(os.path.basename(input_file))[0] def convert_pydub(self): start_time = time.perf_counter() audio = AudioSegment.from_file(self.input_file) audio = audio.set_frame_rate(16000).set_channels(1) @timeit def np_array_conversion(): return np.array(audio.get_array_of_samples()) samples = np_array_conversion() @timeit def np_float_conversion(): return samples.astype(np.float32) audio_array = np_float_conversion() @timeit def np_normalization(arr): return arr / np.iinfo(np.int16).max audio_array = np_normalization(audio_array) # output_file = f"{self.base_name}_pydub.npy" ## @timeit # def np_save(arr, file): # np.save(file, arr) # np_save(audio_array, output_file) end_time = time.perf_counter() return end_time - start_time, audio_array def convert_av(self): start_time = time.perf_counter() container = av.open(self.input_file) audio = container.streams.audio[0] # Set up the resampler resampler = av.audio.resampler.AudioResampler( format='s16', layout='mono', rate=16000 ) @timeit def get_array_of_samples(): audio_frames = [] for frame in container.decode(audio): resampled_frames = resampler.resample(frame) for resampled_frame in resampled_frames: audio_frames.append(resampled_frame) if not audio_frames: return np.array([]) # Concatenate all frames into a single numpy array return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames]) audio_array = get_array_of_samples() @timeit def np_float_conversion(arr): return arr.astype(np.float32) audio_array = np_float_conversion(audio_array) @timeit def np_normalization(arr): return arr / np.iinfo(np.int16).max audio_array = np_normalization(audio_array) # output_file = f"{self.base_name}_av.npy" ## @timeit # def np_save(arr, file): # np.save(file, arr) # # np_save(audio_array, output_file) end_time = time.perf_counter() return end_time - start_time, audio_array def benchmark(input_file): converter = AudioConverter(input_file) pydub_time, pydub_array = converter.convert_pydub() print(f"Pydub conversion took {pydub_time:.6f} seconds") print(pydub_array.shape) av_time, av_array = converter.convert_av() print(f"AV conversion took {av_time:.6f} seconds") print(av_array.shape) assert np.array_equal(pydub_array, av_array) is True if __name__ == "__main__": # input_file = "audio.mp3" input_file = "/content/sam_altman_lex_podcast_367.flac" benchmark(input_file)
And here are the colab results
np_array_conversion took 0.567625 seconds np_float_conversion took 0.081097 seconds np_normalization took 0.215539 seconds Pydub conversion took 30.776541 seconds (138181951,) get_array_of_samples took 29.561768 seconds np_float_conversion took 0.181955 seconds np_normalization took 0.287005 seconds AV conversion took 30.038430 seconds (138181934,)
As you can see, the conversion time is almost the same, but more importantly, the arrays are not equal—there are some missing numbers in your AV implementation. Unless both implementations produce the same arrays, the comparison doesn't make much sense!
Yea dumping to numpy was my idea for preprocessing thousands of files ahead of time to distribute across multiple whisper inference nodes.
As an ancillary matter, I spend a fair amount of time testing using ffmpeg directly using
os
in python to run in command line...It was about 8% faster thanav
no matter no matter how fast I could getav
to run, which makes sense considering thatav
merely wraps ffmpeg and there must be some overhead....And obviously this just pertains to the audio handling and not creating the numpy array/file... With that being said, the benefit ofav
(orpydub
for that matter) is that user's don't have to separately install and add to PATH, which the average non-programmer doesn't know how to do so...Anyways, just thought it was an interesting conversation and wanted to experiment with it.
I also benched
cupy
, which allows GPU-acceleration for a lot of numpy's operations (straight cuda and roc-m btw). I'm holding back that script until I perfect it though...It's awesome but I need to get the batch processing optimized. Hehe...Let me know if my script gives you different results than I got for some reason...
@BBC-Esq,
Yes, Pydub
is just for someone who wants to quickly test things out without having to convert their media files beforehand, and that's why I made the transcribe
function accept NumPy arrays as well!
Cupy
is awesome. let us how it goes.
Can you test your script on the same audio file I did...the sam altman interview? huggingface.co/datasets/reach-vb/random-audios/blob/main/sam_altman_lex_podcast_367.flac
How's the difference between Flac, mp3, wav, etc?
@UsernamesLame, Each format is encoded in a certain way so I suppose there might be some difference.
Okay, so I tested the script you provided.
First off, we don't need to dump the array to .npy, so I'll comment that part out. Actually, I'm not interested in the other parts except for the actual conversion to NumPy, since the other parts are just NumPy operations and should be basically the same. Surprisingly, in your results, np.save shows a huge difference between the two implementations, which indicates that something might be wrong.
Second, the wrapper timeit isn't a good measure for benchmarking because you can't draw conclusions from just one execution. That's why Python has the timeit utility.
But anyways, let's proceed with the script.
import numpy as np import time import os from pydub import AudioSegment import av def timeit(func): def wrapper(*args, **kwargs): start = time.perf_counter() result = func(*args, **kwargs) end = time.perf_counter() print(f"{func.__name__} took {end - start:.6f} seconds") return result return wrapper class AudioConverter: def __init__(self, input_file): self.input_file = input_file self.base_name = os.path.splitext(os.path.basename(input_file))[0] def convert_pydub(self): start_time = time.perf_counter() audio = AudioSegment.from_file(self.input_file) audio = audio.set_frame_rate(16000).set_channels(1) @timeit def np_array_conversion(): return np.array(audio.get_array_of_samples()) samples = np_array_conversion() @timeit def np_float_conversion(): return samples.astype(np.float32) audio_array = np_float_conversion() @timeit def np_normalization(arr): return arr / np.iinfo(np.int16).max audio_array = np_normalization(audio_array) # output_file = f"{self.base_name}_pydub.npy" ## @timeit # def np_save(arr, file): # np.save(file, arr) # np_save(audio_array, output_file) end_time = time.perf_counter() return end_time - start_time, audio_array def convert_av(self): start_time = time.perf_counter() container = av.open(self.input_file) audio = container.streams.audio[0] # Set up the resampler resampler = av.audio.resampler.AudioResampler( format='s16', layout='mono', rate=16000 ) @timeit def get_array_of_samples(): audio_frames = [] for frame in container.decode(audio): resampled_frames = resampler.resample(frame) for resampled_frame in resampled_frames: audio_frames.append(resampled_frame) if not audio_frames: return np.array([]) # Concatenate all frames into a single numpy array return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames]) audio_array = get_array_of_samples() @timeit def np_float_conversion(arr): return arr.astype(np.float32) audio_array = np_float_conversion(audio_array) @timeit def np_normalization(arr): return arr / np.iinfo(np.int16).max audio_array = np_normalization(audio_array) # output_file = f"{self.base_name}_av.npy" ## @timeit # def np_save(arr, file): # np.save(file, arr) # # np_save(audio_array, output_file) end_time = time.perf_counter() return end_time - start_time, audio_array def benchmark(input_file): converter = AudioConverter(input_file) pydub_time, pydub_array = converter.convert_pydub() print(f"Pydub conversion took {pydub_time:.6f} seconds") print(pydub_array.shape) av_time, av_array = converter.convert_av() print(f"AV conversion took {av_time:.6f} seconds") print(av_array.shape) assert np.array_equal(pydub_array, av_array) is True if __name__ == "__main__": # input_file = "audio.mp3" input_file = "/content/sam_altman_lex_podcast_367.flac" benchmark(input_file)
And here are the colab results
np_array_conversion took 0.567625 seconds np_float_conversion took 0.081097 seconds np_normalization took 0.215539 seconds Pydub conversion took 30.776541 seconds (138181951,) get_array_of_samples took 29.561768 seconds np_float_conversion took 0.181955 seconds np_normalization took 0.287005 seconds AV conversion took 30.038430 seconds (138181934,)
As you can see, the conversion time is almost the same, but more importantly, the arrays are not equal—there are some missing numbers in your AV implementation. Unless both implementations produce the same arrays, the comparison doesn't make much sense!
I wouldn't recommend benchmarking on colab but rather on one's own computer. :-) Anyways, when I ran the script you gave me verbatim except changing the file path, I received this error:
File "D:\Scripts\bench_cupy\convert_to_numpy_abet.py", line 117, in benchmark
assert np.array_equal(pydub_array, av_array) is True
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
When I ran this modified script from my first version, albeit removing the creation of the numpy file, I received these results:
pydub_to_numpy took 0.279033 seconds
Pydub conversion took 17.316214 seconds
av_to_numpy took 8.862365 seconds
AV conversion took 8.865208 seconds
Here is the modified script:
Also, the differences are so miniscule that they don't matter as a practical matter. And who is to say that pydub
is correct and not av
or vice versa. Moreover, I haven't verified whether both use the same samplers and other sub-libraries so...perhaps they're both "correct" in that the very very very minor differences are due to the different sub-libraries and/or versions of them.
If you use the compare_npy_files
function it'll show the miniscule difference.
Here's an example:
0.00e+00 to 3.34e-02: 137870152 samples
3.34e-02 to 6.69e-02: 285089 samples
6.69e-02 to 1.00e-01: 23120 samples
1.00e-01 to 1.34e-01: 2989 samples
1.34e-01 to 1.67e-01: 477 samples
1.67e-01 to 2.01e-01: 91 samples
2.01e-01 to 2.34e-01: 10 samples
2.34e-01 to 2.67e-01: 3 samples
2.67e-01 to 3.01e-01: 1 samples
3.01e-01 to 3.34e-01: 2 samples
Another possible difference could be the libraries me versus whatever cloud service on Google's servers are installed...Perhaps that accounts for the divergent results in part...and the fact that I received an error while the Colab worked...
As promised, here's the thread I'm making for this.
RE: pre-processing:
In
pywhispercpp/model.py
we havetranscribe
and it can take a numpy ndarray. What I was thinking is, rather than load in audio, crush it to mono, set it to 16khz, why not pre-process all that and generate binary blob files that we can feed in that just contain the numpy ndarray?It's not a big performance increase, but anything we can do outside of Python land ahead of time will give us a win. And I'm ok chasing micro-optimizations in Python land. I'm useless in C++ land.
Also let's put all logging behind a flag to disable it. If possible, lets add a flag to disable
whisper.cpp
's incessant logging info to stderr. I know it has no impact on the transcription audio, but it should be controllable.RE: copy.deepcopy
We need to drop @statimethod everywhere, and implement the deep copy methods on the C++ side. This is a minor request from me, it would just let us initialize the model in memory and create a deep copy that we can treat as a completely independent instance.
The other option is I can write a helper class using BytesIO to hold the model in memory and we can feed that to the Model class I guess? It would still be better than re-initializing the model to create a sterile instance.
RE: micro-optimizations
Under
_get_segments
we haveassert end <= n, f"{end} > {n}: `End` index must be less or equal than the total number of segments"
but I have to ask, is it even possible to end up in a situation where this assert would come true?RE: features
Lets make the model usable in a context manager so we can do quick and dirty things like:
Not really necessary, just gives a more pleasant way of interacting with the model class.