Closed UsernamesLame closed 2 months ago
@abdeladim-s Please review this. In the meantime I'm going to write a helper script to auto convert various types of files into something whisper can accept.
@UsernamesLame, Approved 👍
Looking forward your helper script :)
https://github.com/UsernamesLame/WhisperWav :)
It's not remotely done, but it works. I'm thinking of making it spit out pickled numpy arrays of the audio for faster transcription. As of now, it spits out wav files that Whisper.cpp
can handle
The script looks nice, but if I understand it correctly, it's similar to what I am already doing!
There is no need to convert the files to wav
as we are calling the cpp functions directly, not calling whisper.cpp
using the CLI.
Numpy array of the audio file is what we need.
The script looks nice, but if I understand it correctly, it's similar to what I am already doing!
There is no need to convert the files to
wav
as we are calling the cpp functions directly, not callingwhisper.cpp
using the CLI.Numpy array of the audio file is what we need.
No I know. But my understanding is you're doing some pre processing pre transcription.
I want to do all the pre processing on large batches of data while transcription is actively happening to remove one more task from the processing pipeline.
Also, if I'm understanding correctly, you're not actually converting it to 16bit files and that may harm transcription.
I'll look over it again, but I also want to remove the need to convert the wav files to numpy arrays before transcribing.
That would help with performance. At least according to my CPython internals book that recommends a lot of things to avoid.
I'm aiming to make the slowest part loading the model into memory. And even then, I want to explore making it '''deepcopy''' able so we can load the model, clone it in memory, and change its settings as needed from the defaults by using copy.deepcopy.
This way multiple instances can be loaded 100% independent of each other vs the current situation with static methods.
I explored making everything not static and deep copy friendly but ran into some issues with PyBind11.
It's late here, so I might have to advise on that front tomorrow.
Yes, pre processing is necessary to make every type of media compatible with whisper.cpp
The files need to be converted to 16Khz mono, similar to what you did in the script, it's actually here
If you removed the conversion to numpy array, how you are going to pass the data to c++ ?
Other than that, good luck with your exploration, the code is yours .. looking forward it :smile:
I can merge this for now if you want ?
- Yes, pre processing is necessary to make every type of media compatible with
whisper.cpp
- The files need to be converted to 16Khz mono, similar to what you did in the script, it's actually here
- If you removed the conversion to numpy array, how you are going to pass the data to c++ ?
Other than that, good luck with your exploration, the code is yours .. looking forward it 😄
I can merge this for now if you want ?
Yea please do!
As for how we would load them, we could re-initialize the numpy.ndarrays I guess and pass them along to the C++ interface? I'm struggling to explain it. I need to write some code to show it. Also I was confused why the code wasn't updated, just saw you merge it.
Edit:
My PR isn't merged(?) Says 2 workflows awaiting approval.
Also yea I noticed the channel crushing, sorry. I'm super tired right now so I shouldn't comment too much. I am going to see what else I can work on tomorrow.
Go get some rest .. Take your time! Looking forward your futur PRs. I'll merge this one for now!
Thanks for the contribution :smile:
Calling
set_channels(1) already converts the audio to one channel, so there was no need to call
split_to_mono, instead we can just call
get_array_of_samples```This should have a negligible positive impact on performance. Can someone test this quickly for me? I'm not in a position to test at the moment.