Performance Improvement ideas / feature requests

UsernamesLame commented 2 months ago

As promised, here's the thread I'm making for this.

RE: pre-processing:

In pywhispercpp/model.py we have transcribe and it can take a numpy ndarray. What I was thinking is, rather than load in audio, crush it to mono, set it to 16khz, why not pre-process all that and generate binary blob files that we can feed in that just contain the numpy ndarray?

It's not a big performance increase, but anything we can do outside of Python land ahead of time will give us a win. And I'm ok chasing micro-optimizations in Python land. I'm useless in C++ land.

Also let's put all logging behind a flag to disable it. If possible, lets add a flag to disable whisper.cpp's incessant logging info to stderr. I know it has no impact on the transcription audio, but it should be controllable.

RE: copy.deepcopy

We need to drop @statimethod everywhere, and implement the deep copy methods on the C++ side. This is a minor request from me, it would just let us initialize the model in memory and create a deep copy that we can treat as a completely independent instance.

The other option is I can write a helper class using BytesIO to hold the model in memory and we can feed that to the Model class I guess? It would still be better than re-initializing the model to create a sterile instance.

RE: micro-optimizations

Under _get_segments we have assert end <= n, f"{end} > {n}: `End` index must be less or equal than the total number of segments" but I have to ask, is it even possible to end up in a situation where this assert would come true?

RE: features

Lets make the model usable in a context manager so we can do quick and dirty things like:


with Model("base.en", n_threads=6) as model:
    for segments in model.transcribe("file.mp3")
        for segment in segments:
            print(segment)

Not really necessary, just gives a more pleasant way of interacting with the model class.

UsernamesLame commented 2 months ago

Looked into the numpy array saving: https://numpy.org/doc/stable/reference/generated/numpy.save.html

We can save the converted audio files to disk before feeding them to the model. This way we technically bypass the need for PyDub and ffmpeg. It also means no launching background processes (PyDub with ffmpeg) to manipulate audio so it's ready for the model to ingest.

abdeladim-s commented 2 months ago

@UsernamesLame, Thanks for the ideas!

I don't think I understand the first point correctly, maybe some code will make it clear.
About logging: Yes it's annoying that logs are written to stderr, it's possible to add the flag, but needs some tweaks.
copy.deepcopy: What's that for ? You can create as many instances as you want! Maybe some code will be useful in here as well.
_get_segments: Yes might happen, if you want to get segments more than what whispercpp actually generated.
The context: Good feature, I actually started it at that time but I don't remember what happened why it's not there :sweat_smile:

UsernamesLame commented 2 months ago

This is what I was trying to explain:


sound = AudioSegment.from_file(media_file_path)

sound = sound.set_frame_rate(constants.WHISPER_SAMPLE_RATE).set_channels(1)

arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
arr /= np.iinfo(samples[0].typecode).max

with open("file.npy", "wb") as file:
    np.save(file, arr, allow_pickle=False)

I haven't tested it yet, but the idea is do all the operations we need on the numpy array ahead of time and then later we can just do something like:

array = np.load("file.npy")

_transcribe(array)

This way we can mass process our audio files before we load them into memory for whisper to process.

@abdeladim-s hopefully this makes sense now! The idea is to batch process hundreds if not thousands of audio files ahead of time in parallel (I can write a script to do this for us) and save them in a format we can just load into the model and get transcriptions back from.

Yes I know numpy should be fast but every context switch we can avoid the better.

abdeladim-s commented 2 months ago

Looked into the numpy array saving: numpy.org/doc/stable/reference/generated/numpy.save.html

We can save the converted audio files to disk before feeding them to the model. This way we technically bypass the need for PyDub and ffmpeg. It also means no launching background processes (PyDub with ffmpeg) to manipulate audio so it's ready for the model to ingest.

PyDub and ffmpeg are actually there for the conversion to numpy arrays!
If we have numpy arrays, why we would need to save them to disk ?

UsernamesLame commented 2 months ago

Looked into the numpy array saving: numpy.org/doc/stable/reference/generated/numpy.save.html We can save the converted audio files to disk before feeding them to the model. This way we technically bypass the need for PyDub and ffmpeg. It also means no launching background processes (PyDub with ffmpeg) to manipulate audio so it's ready for the model to ingest.

PyDub and ffmpeg are actually there for the conversion to numpy arrays! If we have numpy arrays, why we would need to save them to disk ?

Pre-processing. Every context switch we can avoid the better! Imagine transcribing thousands of files.

The current solution looks like this:

pywhispercpp -> PyDub -> ffmpeg -> PyDub -> pywhispercpp -> numpy -> pywhispercpp -> PyBind11 -> whisper -> PyBind11 -> pywhispercpp

With my proposal it would look more like this:

pywhispercpp -> numpy -> pywhispercpp -> PyBind11 -> whisper -> PyBind11 -> pywhispercpp

The goal isn't to make this a full replacement for the existing solution, but I tomorrow I'll write a demo showing an alternative to load data into the model cutting out as many context switches as possible to gain some performance.

abdeladim-s commented 2 months ago

This is what I was trying to explain:
sound = AudioSegment.from_file(media_file_path)

sound = sound.set_frame_rate(constants.WHISPER_SAMPLE_RATE).set_channels(1)

arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
arr /= np.iinfo(samples[0].typecode).max

with open("file.npy", "wb") as file:
    np.save(file, arr, allow_pickle=False)
I haven't tested it yet, but the idea is do all the operations we need on the numpy array ahead of time and then later we can just do something like:
array = np.load("file.npy")

_transcribe(array)
This way we can mass process our audio files before we load them into memory for whisper to process.

@abdeladim-s hopefully this makes sense now! The idea is to batch process hundreds if not thousands of audio files ahead of time in parallel (I can write a script to do this for us) and save them in a format we can just load into the model and get transcriptions back from.

Yes I know numpy should be fast but every context switch we can avoid the better.

Okey, so the idea is to process large amount of files ? But I think it's the same, if not worse, taking into consideration the overhead of saving and loading the files to/from disk. And you will need to wait for the conversion in any ways.
IO operations are worse than using memory.

UsernamesLame commented 2 months ago

This is what I was trying to explain:
sound = AudioSegment.from_file(media_file_path)

sound = sound.set_frame_rate(constants.WHISPER_SAMPLE_RATE).set_channels(1)

arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
arr /= np.iinfo(samples[0].typecode).max

with open("file.npy", "wb") as file:
    np.save(file, arr, allow_pickle=False)
I haven't tested it yet, but the idea is do all the operations we need on the numpy array ahead of time and then later we can just do something like:
array = np.load("file.npy")

_transcribe(array)
This way we can mass process our audio files before we load them into memory for whisper to process. @abdeladim-s hopefully this makes sense now! The idea is to batch process hundreds if not thousands of audio files ahead of time in parallel (I can write a script to do this for us) and save them in a format we can just load into the model and get transcriptions back from. Yes I know numpy should be fast but every context switch we can avoid the better.
Okey, so the idea is to process large amount of files ? But I think it's the same, if not worse, taking into consideration the overhead of saving and loading the files to/from disk. And you will need to wait for the conversion in any ways. IO operations are worse than using memory.

IO operations are generally cheaper than context switches. I'll test this unless you want to.

I can also read the files into memory and store them in a BytesIO object and read from it like a filesystem object too. There's a lot of ways this can be taken. But I genuinely believe that avoiding context switches > IO

RE: deepcopy

You can create completely independent objects that are clones of existing objects. Think instead of myModel = Model, we do myModel = existingModel.deepclone(). So we don't read the model weights from disk again, but instead do an in memory copy.

abdeladim-s commented 2 months ago

IO operations are generally cheaper than context switches. I'll test this unless you want to.

I can also read the files into memory and store them in a BytesIO object and read from it like a filesystem object too. There's a lot of ways this can be taken. But I genuinely believe that avoiding context switches > IO

RE: deepcopy

You can create completely independent objects that are clones of existing objects. Think instead of myModel = Model, we do myModel = existingModel.deepclone(). So we don't read the model weights from disk again, but instead do an in memory copy.

Yes please, go ahead and test! experiments and Numbers will save us a lot of talk :) Looking forward it!

UsernamesLame commented 2 months ago

IO operations are generally cheaper than context switches. I'll test this unless you want to. I can also read the files into memory and store them in a BytesIO object and read from it like a filesystem object too. There's a lot of ways this can be taken. But I genuinely believe that avoiding context switches > IO RE: deepcopy You can create completely independent objects that are clones of existing objects. Think instead of myModel = Model, we do myModel = existingModel.deepclone(). So we don't read the model weights from disk again, but instead do an in memory copy.

Yes please, go ahead and test! experiments and Numbers will save us a lot of talk :) Looking forward it!

So I'm testing wth a sample 33mb mp3 and the results are promising. Pre-processing into a numpy array and saving to disk shrinks it to 5.4mb so we can definitely have an impact on memory footprint with a helper script! Let me test transcription performance.

UsernamesLame commented 2 months ago

I have numbers for you @abdeladim-s!

Here's the script:

``` py from pywhispercpp.model import Model import numpy as np import time def usenumpy(): model = Model('base') audio_data = np.load("file.npy") segments = model.transcribe(audio_data) for segment in segments: print(segment) def useaudiofile(): model = Model('base') segments = model.transcribe("audio.mp3") for segment in segments: print(segment) begin = time.time() usenumpy() end = time.time() print("*" * 20) print(f"using raw numpy array finished in {end - begin}") print("*" * 20) begin = time.time() useaudiofile() end = time.time() print("*" * 20) print(f"using mp3 file inished in {end - begin}") print("*" * 20) ```

Here's the results!

********************
using raw numpy array finished in 2.6472320556640625
********************

********************
using mp3 file inished in 26.56456184387207
********************

On a M1 Pro MBP with 16gb of ram, not using the Metal backend, using the base whisper model.

Told ya it would have an improvement on processing time to pre-process the audio files into Numpy arrays!

This computer has a memory bandwidth of 200GB/s, and disk bandwidth of around 4GB/s. Context switching costs more than just loading raw data into memory :)

I am going to chase every optimization I can like a dog chases its tail.

UsernamesLame commented 2 months ago


from pydub import AudioSegment
import numpy as np

sound = AudioSegment.from_file("audio.mp3")

sound = sound.set_frame_rate(1600).set_channels(1)
arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)

with open("file.npy", "wb") as file:
    np.save(file, arr, allow_pickle=False)

This is the pre-conversion script. I'm going to update WhisperWav to output numpy arrays that can be fed directly into the model.

UsernamesLame commented 2 months ago

So I tried pre-converting a few files. Most work, but at random Numpy will completely mangle the conversion to a ndarray and saving leading to UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data with numpy.load.

If anyone has any idea why it's randomly mangling things I'd love help here.

Edit:

Yea I'm at a complete dead end as to why numpy insists on butchering audio files at random when saving. When it works, the speedups are insane. When it doesn't work, the errors are absolutely useless.

Edit 2:

I decided to see if Copilot could help. It suggested:


with open("audio.npy", "rb") as f:
    audio_data = np.fromfile(f, dtype=np.float32)

And so far it seems to be working?

UsernamesLame commented 2 months ago

Ok so final comment for now. A 42m audio file at 101 mb once crushed to mono and audio bitrate set to 1600khz becomes a 17mb~ npy file.

Processing the npy file takes around 10 seconds. Processing the raw wav file takes around 63 seconds.

This doesn't seem like an error or unreasonable. Can someone else please try and reproduce? Are we literally spending that much time prepping the file?!

abdeladim-s commented 2 months ago

I still don't get what you are trying to achieve, but if I understand it correctly, it's basically the same as what I did, except that you are trying to dump and load the npy array, and you've made a deadly bug! lol

Also, when you did the experiment, why you didn't calculate the time needed to convert the files to npy, people are not moving around with dumped npy arrays of their media files :sweat_smile:

Here is what I think this should be:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from pywhispercpp.model import Model
import numpy as np
import time
from pydub import AudioSegment

def usenumpy():
    # This part from your script should be included as well! ##########
    sound = AudioSegment.from_file("audio.mp3")
    # Here 16Khz not 1600 !!!! That's what you were doing wrong !!! 
    sound = sound.set_frame_rate(16000).set_channels(1)
    arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
    arr /= np.iinfo(np.int16).max # Normalization is important! otherwise you will get 'utf-8' codec can't decode bytes
    # dump array to npy file
    with open("file.npy", "wb") as file:
        np.save(file, arr, allow_pickle=False)
    #################### 
    # load model
    model = Model('base')
    # load array from npy file
    audio_data = np.load("file.npy")
    segments = model.transcribe(audio_data)
    for segment in segments:
        print(segment)

def useaudiofile():
    model = Model('base')
    segments = model.transcribe("audio.mp3")
    for segment in segments:
        print(segment)

begin = time.time()
usenumpy()
end = time.time()
print("*" * 20)
print(f"using raw numpy array finished in {end - begin}")
print("*" * 20)

begin = time.time()
useaudiofile()
end = time.time()
print("*" * 20)
print(f"using mp3 file finished in {end - begin}")
print("*" * 20)

I used this file from my other project, here are the results:

[2024-08-30 17:34:17,168] {model.py:130} INFO - Transcribing ...
[2024-08-30 17:34:33,929] {model.py:133} INFO - Inference time: 16.761 s
t0=0, t1=424, text=[Music]
t0=424, t1=800, text=What exactly is artificial intelligence?
t0=800, t1=1192, text=We speak of AI when computer systems perform tasks
t0=1192, t1=1448, text=that usually require human intelligence.
t0=1448, t1=1624, text=This includes, for example,
t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.
t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.
t0=2624, t1=2824, text=This can be achieved in two ways.
t0=2824, t1=3000, text=[Music]
t0=3000, t1=3280, text=You can program each individual instruction
t0=3280, t1=3544, text=so that the machine solve the task step by step.
t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.
t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.
t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.
t0=5032, t1=5336, text=This is known as machine learning.
t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.
t0=5904, t1=6224, text=When we watch films, listen to music or shop online.
t0=6224, t1=6528, text=AI gives us recommendations about what we might like.
t0=6528, t1=7080, text=AI is capable of converting spoken language into text
t0=7080, t1=7312, text=and translating it into other languages.
t0=7312, t1=8040, text=AI is a central component of robotics.
t0=8040, t1=8288, text=Robots make our everyday lives easier
t0=8288, t1=8488, text=or take on strenuous activities.
t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI
t0=8984, t1=9096, text=and can react to it.
t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.
t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.
t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.
t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.
t0=11280, t1=11544, text=For example, on digital learning platforms.
t0=11544, t1=11928, text=AI is becoming increasingly important.
t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities
t0=12504, t1=12688, text=at home and at work.
t0=12688, t1=12896, text=And where we would rather make our own decisions.
t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.
t0=13512, t1=13840, text=For this, we need an AI-competent society.
t0=13840, t1=14176, text=[MUSIC PLAYING]
t0=14176, t1=14376, text=you
********************
using raw numpy array finished in 17.416718244552612
********************
[2024-08-30 17:34:34,516] {model.py:130} INFO - Transcribing ...
[2024-08-30 17:34:50,128] {model.py:133} INFO - Inference time: 15.612 s
t0=0, t1=424, text=[Music]
t0=424, t1=800, text=What exactly is artificial intelligence?
t0=800, t1=1192, text=We speak of AI when computer systems perform tasks
t0=1192, t1=1448, text=that usually require human intelligence.
t0=1448, t1=1624, text=This includes, for example,
t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.
t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.
t0=2624, t1=2824, text=This can be achieved in two ways.
t0=2824, t1=3000, text=[Music]
t0=3000, t1=3280, text=You can program each individual instruction
t0=3280, t1=3544, text=so that the machine solve the task step by step.
t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.
t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.
t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.
t0=5032, t1=5336, text=This is known as machine learning.
t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.
t0=5904, t1=6224, text=When we watch films, listen to music or shop online.
t0=6224, t1=6528, text=AI gives us recommendations about what we might like.
t0=6528, t1=7080, text=AI is capable of converting spoken language into text
t0=7080, t1=7312, text=and translating it into other languages.
t0=7312, t1=8040, text=AI is a central component of robotics.
t0=8040, t1=8288, text=Robots make our everyday lives easier
t0=8288, t1=8488, text=or take on strenuous activities.
t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI
t0=8984, t1=9096, text=and can react to it.
t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.
t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.
t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.
t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.
t0=11280, t1=11544, text=For example, on digital learning platforms.
t0=11544, t1=11928, text=AI is becoming increasingly important.
t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities
t0=12504, t1=12688, text=at home and at work.
t0=12688, t1=12896, text=And where we would rather make our own decisions.
t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.
t0=13512, t1=13840, text=For this, we need an AI-competent society.
t0=13840, t1=14176, text=[MUSIC PLAYING]
t0=14176, t1=14376, text=you
********************
using mp3 file finished in 16.196656465530396
********************

This is not a real experiment per say, but as you can see, they are almost the same. There is no need to dump and load the numpy array!

Lmk what do you think ?

UsernamesLame commented 2 months ago

I still don't get what you are trying to achieve, but if I understand it correctly, it's basically the same as what I did, except that you are trying to dump and load the npy array, and you've made a deadly bug! lol

Also, when you did the experiment, why you didn't calculate the time needed to convert the files to npy, people are not moving around with dumped npy arrays of their media files :sweat_smile:

Here is what I think this should be:


#!/usr/bin/env python

# -*- coding: utf-8 -*-

from pywhispercpp.model import Model

import numpy as np

import time

from pydub import AudioSegment

def usenumpy():

  # This part from your script should be included as well! ##########

    sound = AudioSegment.from_file("audio.mp3")

    # Here 16Khz not 1600 !!!! That's what you were doing wrong !!! 

    sound = sound.set_frame_rate(16000).set_channels(1)

    arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)

    arr /= np.iinfo(np.int16).max # Normalization is important! otherwise you will get 'utf-8' codec can't decode bytes

    # dump array to npy file

    with open("file.npy", "wb") as file:

        np.save(file, arr, allow_pickle=False)

  #################### 

    # load model

    model = Model('base')

    # load array from npy file

    audio_data = np.load("file.npy")

    segments = model.transcribe(audio_data)

    for segment in segments:

        print(segment)

def useaudiofile():

    model = Model('base')

    segments = model.transcribe("audio.mp3")

    for segment in segments:

        print(segment)

begin = time.time()

usenumpy()

end = time.time()

print("*" * 20)

print(f"using raw numpy array finished in {end - begin}")

print("*" * 20)

begin = time.time()

useaudiofile()

end = time.time()

print("*" * 20)

print(f"using mp3 file finished in {end - begin}")

print("*" * 20)

I used this file from my other project, here are the results:


[2024-08-30 17:34:17,168] {model.py:130} INFO - Transcribing ...

[2024-08-30 17:34:33,929] {model.py:133} INFO - Inference time: 16.761 s

t0=0, t1=424, text=[Music]

t0=424, t1=800, text=What exactly is artificial intelligence?

t0=800, t1=1192, text=We speak of AI when computer systems perform tasks

t0=1192, t1=1448, text=that usually require human intelligence.

t0=1448, t1=1624, text=This includes, for example,

t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.

t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.

t0=2624, t1=2824, text=This can be achieved in two ways.

t0=2824, t1=3000, text=[Music]

t0=3000, t1=3280, text=You can program each individual instruction

t0=3280, t1=3544, text=so that the machine solve the task step by step.

t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.

t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.

t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.

t0=5032, t1=5336, text=This is known as machine learning.

t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.

t0=5904, t1=6224, text=When we watch films, listen to music or shop online.

t0=6224, t1=6528, text=AI gives us recommendations about what we might like.

t0=6528, t1=7080, text=AI is capable of converting spoken language into text

t0=7080, t1=7312, text=and translating it into other languages.

t0=7312, t1=8040, text=AI is a central component of robotics.

t0=8040, t1=8288, text=Robots make our everyday lives easier

t0=8288, t1=8488, text=or take on strenuous activities.

t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI

t0=8984, t1=9096, text=and can react to it.

t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.

t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.

t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.

t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.

t0=11280, t1=11544, text=For example, on digital learning platforms.

t0=11544, t1=11928, text=AI is becoming increasingly important.

t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities

t0=12504, t1=12688, text=at home and at work.

t0=12688, t1=12896, text=And where we would rather make our own decisions.

t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.

t0=13512, t1=13840, text=For this, we need an AI-competent society.

t0=13840, t1=14176, text=[MUSIC PLAYING]

t0=14176, t1=14376, text=you

********************

using raw numpy array finished in 17.416718244552612

********************

[2024-08-30 17:34:34,516] {model.py:130} INFO - Transcribing ...

[2024-08-30 17:34:50,128] {model.py:133} INFO - Inference time: 15.612 s

t0=0, t1=424, text=[Music]

t0=424, t1=800, text=What exactly is artificial intelligence?

t0=800, t1=1192, text=We speak of AI when computer systems perform tasks

t0=1192, t1=1448, text=that usually require human intelligence.

t0=1448, t1=1624, text=This includes, for example,

t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.

t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.

t0=2624, t1=2824, text=This can be achieved in two ways.

t0=2824, t1=3000, text=[Music]

t0=3000, t1=3280, text=You can program each individual instruction

t0=3280, t1=3544, text=so that the machine solve the task step by step.

t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.

t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.

t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.

t0=5032, t1=5336, text=This is known as machine learning.

t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.

t0=5904, t1=6224, text=When we watch films, listen to music or shop online.

t0=6224, t1=6528, text=AI gives us recommendations about what we might like.

t0=6528, t1=7080, text=AI is capable of converting spoken language into text

t0=7080, t1=7312, text=and translating it into other languages.

t0=7312, t1=8040, text=AI is a central component of robotics.

t0=8040, t1=8288, text=Robots make our everyday lives easier

t0=8288, t1=8488, text=or take on strenuous activities.

t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI

t0=8984, t1=9096, text=and can react to it.

t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.

t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.

t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.

t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.

t0=11280, t1=11544, text=For example, on digital learning platforms.

t0=11544, t1=11928, text=AI is becoming increasingly important.

t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities

t0=12504, t1=12688, text=at home and at work.

t0=12688, t1=12896, text=And where we would rather make our own decisions.

t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.

t0=13512, t1=13840, text=For this, we need an AI-competent society.

t0=13840, t1=14176, text=[MUSIC PLAYING]

t0=14176, t1=14376, text=you

********************

using mp3 file finished in 16.196656465530396

********************

This is not a real experiment per say, but as you can see, they are almost the same. There is no need to dump and load the numpy array!

Lmk what do you think ?

I caught the deadly bug locally and fixed it locally.

As for performance, it's odd you aren't getting better results and I am.

I'm guessing it has something to due with the memory bandwidth of M1 Pro vs x86 chips?

But yea you're understanding now. I haven't tested it on x86. Also I didn't include the converting to numpy arrays because the idea is to mass transform it then transcribe.

At least one benefit is the numpy arrays are generally smaller in my experience.

What are your system specs btw? And Python version? I'm using 3.12 and getting good results.

If I can't increase performance I can at least lower memory usage I guess. 😅

My idea is to let the model be long lived and keep feeding it fresh areas dumps as it transcribes them one after another. This way in a different process (I'm going to edit e action to show this) we can spawn sub processes to mass convert media files to numpy arrays.

The idea is that the model is the limiting factor, as in most people don't have the CPU / RAM to load 2 - 4 models, so if we can pre-process the files so the model can transcribe faster with less memory, it's still a (small) win!

I have access to a 128core ARM box that is piss slow at transcribing but can quickly spit out these bumpy arrays.

It's not gonna benefit everyone, but it's worth exploring the thought. It's also possible to store all the numpy arrays in a single database that clients running the models pull from to transcribe creating transcription cluster. The big benefit being that the clients can be small like a raspberry pi and still considerably faster transcriptions.

UsernamesLame commented 2 months ago

I'm running a few more tests, including ensuring the numpy arrays produce the same results as the mp3, mostly because I can't believe that after crushing the frame rate, the response frequency, and the channels, I can go from 100mb to 70mb.

As of now, numpy is getting me 17 seconds while mp3 is 69 seconds. Timing the conversion to a numpy array gets me 5 seconds. So 21 seconds vs 69.

The performance gap has shrunk, but it's not gone. It's still ~3x faster to pre-process numpy arrays and then load them. I'm not saying everyone should, but it would make a fun example!

Edit:

I forgot to mention, I go us added to Whisper.cpp's README.md :)

https://github.com/ggerganov/whisper.cpp/pull/2396

Merged already. I felt like we were ready for more visibility.

abdeladim-s commented 2 months ago

I have an i7 8c/16t with 32 GB DDR4, running Python 3.10 .. When I tested the code provided with 1600 sample rate, I got results similar to yours, which is obvious because it's like 10x down-sampling, but when I fixed it it's almost the same, It's the same algorithm running under the hood anyway!
I can see the benefits of batch pre-processing, and this is exactly why I made the transcribe function accepts (audio file as well as numpy array) , if you want something quickly you can throw whatever file and the library will convert it for you, if you are a power user and you know what you are doing, you can use numpy arrays directly, in that case the pre-processing step will be ignored! I think from a library point of view this gives more flexibility to the users!

abdeladim-s commented 2 months ago

I'm running a few more tests, including ensuring the numpy arrays produce the same results as the mp3, mostly because I can't believe that after crushing the frame rate, the response frequency, and the channels, I can go from 100mb to 7mb.

As of now, numpy is getting me 17 seconds while mp3 is 69 seconds. Timing the conversion to a numpy array gets me 5 seconds. So 21 seconds vs 69.

The performance gap has shrunk, but it's not gone. It's still ~3x faster to pre-process numpy arrays and then load them. I'm not saying everyone should, but it would make a fun example!

Edit:

I forgot to mention, I go us added to Whisper.cpp's README.md :)

ggerganov/whisper.cpp#2396

Merged already. I felt like we were ready for more visibility.

You can't tell from one example! You have to test multiple times and average the results, It's the same algorithm I used, so you should get basically the same results, unless there is some magic in dumping and loading the npy files
Oh, I just noticed you made a PR for this, you really think we are ready?! It's a small project, does not deserve that visibility :sweat_smile: But Thanks anyways!

UsernamesLame commented 2 months ago

I'm running a few more tests, including ensuring the numpy arrays produce the same results as the mp3, mostly because I can't believe that after crushing the frame rate, the response frequency, and the channels, I can go from 100mb to 7mb. As of now, numpy is getting me 17 seconds while mp3 is 69 seconds. Timing the conversion to a numpy array gets me 5 seconds. So 21 seconds vs 69. The performance gap has shrunk, but it's not gone. It's still ~3x faster to pre-process numpy arrays and then load them. I'm not saying everyone should, but it would make a fun example! Edit: I forgot to mention, I go us added to Whisper.cpp's README.md :) ggerganov/whisper.cpp#2396 Merged already. I felt like we were ready for more visibility.

You can't tell from one example! You have to test multiple times and average the results, It's the same algorithm I used, so you should get basically the same results, unless there is some magic in dumping and loading the npy files

Oh, I just noticed you made a PR for this, you really think we are ready?! It's a small project, does not deserve that visibility 😅 But Thanks anyways!

Re testing: I know one test isn't enough, but still it's promising!

Re pywhispercpp: It 100% deserves the visibility!

Also I double checked I'm using 16000 locally, and:

 ********************
using raw numpy array finished in 10.000927925109863
********************

That's still a pretty drastic difference. Also, when I accidentally did it with 1600, there was no real drop in accuracy on simpler audio files.

abdeladim-s commented 2 months ago

I think It should not be a drastic difference in my opinion, as long as you are using the same algorithm as _load_audio.
If you have numpy arrays you can pass them through the transcribe function without any problem, as I said, the pre-processing step won't be executed!
Or maybe I am wrong and I missed something! and I need to make an optimization somewhere!

UsernamesLame commented 2 months ago

Let's put my numpy theories to the test. I'm going to crush around 6h of audio into numpy arrays and transcribe it.

I think It should not be a drastic difference in my opinion, as long as you are using the same algorithm as _load_audio.

If you have numpy arrays you can pass them through the transcribe function without any problem, as I said, the pre-processing step won't be executed!

Or maybe I am wrong and I missed something! and I need to make an optimization somewhere!

It's really down to batch processing and pre-normalizing the numpy arrays making a very big difference on ARM (M1 Pro). I'm going to test feeding around 7.5h of audio into it and post the results.

Edit:

Just over 6gb of files converted into numpy arrays in 33 seconds. Time to transcribe!

Edit 2:

Whisper just spat out some debug logs. 174 seconds to transcribe 1h of audio with normalized numpy arrays!

Extrapolating this, it should take 17 minutes to transcribe >6h of audio. Lets see what actually happens as whisper spat out another debug log saying it finished in 147 seconds.

UsernamesLame commented 2 months ago

********************
using raw numpy array finished in 1105.0404160022736
********************

neat!

Edit:

We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes.

So far the speed up isn't that promising, but the next check should be memory usage!

UsernamesLame commented 2 months ago

https://github.com/EtienneAb3d/WhisperHallu?tab=readme-ov-file

I found this, a project about optimizing for whisper!

abdeladim-s commented 2 months ago

********************
using raw numpy array finished in 1105.0404160022736
********************
neat!

Edit:

We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes.

So far the speed up isn't that promising, but the next check should be memory usage!

Interesting result!

abdeladim-s commented 2 months ago

EtienneAb3d/WhisperHallu

I found this, a project about optimizing for whisper!

Sounds great, I'll take a look

UsernamesLame commented 2 months ago

********************
using wav finished in 1575.6269478797913
********************

ouch!

UsernamesLame commented 2 months ago

********************
using raw numpy array finished in 1105.0404160022736
********************
neat! Edit: We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes. So far the speed up isn't that promising, but the next check should be memory usage!
Interesting result!

26 minutes for raw wav files, 17 minutes with numpy arrays.

I think we have a winner? Opinion?

Next test will be memory usage I guess.

abdeladim-s commented 2 months ago

interesting! .. I think it's because of the parallel pre-conversion of the files to numpy. For small number of files, this won't have a huge effect! But I have an idea, if you can replicate the same on Colab, that will give us a clear view of what's really happening in a fresh environnement!

UsernamesLame commented 2 months ago

interesting! .. I think it's because of the parallel pre-conversion of the files to numpy. For small number of files, this won't have a huge effect! But I have an idea, if you can replicate the same on Colab, that will give us a clear view of what's really happening in a fresh environnement!

I've never used colab before, so here's the code.


from pywhispercpp.model import Model
import numpy as np
import time
import os
from glob import glob

model = Model('base')

def usenumpy():
    files = [f for f in glob("*") if os.path.isfile(f) and f.endswith((".pyd"))]
    for file in files:
        with open(f"{file}", "rb") as f:
            audio_data = np.fromfile(f, dtype=np.float32)
            numpy_segments = model.transcribe(audio_data)

def usewav():
        files = [f for f in glob("*") if os.path.isfile(f) and f.endswith((".wav"))]
        for file in files:
            mp3_segments = model.transcribe(file)

begin = time.time()
usewav()
end = time.time()
print("*" * 20)
print(f"using wav finished in {end - begin}")
print("*" * 20)

I used cobalt.tools to download a 1.5h video's audio from YouTube as a WAV, then converted it with this:

from pydub import AudioSegment
import numpy as np
from glob import glob
import os
import time

begin = time.time()

files = [f for f in glob("*") if os.path.isfile(f) and not f.endswith((".npy", ".md", ".txt", ".py", ".cfg"))]

for file in files:
    sound = AudioSegment.from_file(file)

    sound = sound.set_frame_rate(16000).set_channels(1)
    numpy_array = np.array(sound.get_array_of_samples()).T.astype(np.float32)
    numpy_array /= np.iinfo(np.int16).max

    with open(f"{file}.npy", "wb") as f:
        np.save(f, numpy_array, allow_pickle=False)

end = time.time()
print(f"{end - begin} seconds elsapsed")

I feel like it should be ok to feed it the same audio file 6 times to get a general idea as it seems like whisper performs worse with each pass, not better.

If you want to make a colab / Jupiter notebook, I'll gladly poke around with you. My theory is that the audio files being massive is causing the issue. The numpy arrays I save to disk are much smaller by comparison. The .wav is around 1gb, the .pyk is around 393mb.

Anyways, for now I must say goodnight my friend! Don't let the geese bite!

abdeladim-s commented 2 months ago

So, the large files are causing the issue ?! Probably! But I am still confused, why, convert -> save -> load -> transcribe is faster than convert -> transcribe.

Anyways, good luck with your exploration, let me know if find any optimizations we can add to the repo, Goodnight :)

UsernamesLame commented 2 months ago

So, the large files are causing the issue ?! Probably!

But I am still confused, why, convert -> save -> load -> transcribe is faster than convert -> transcribe.

Anyways, good luck with your exploration, let me know if find any optimizations we can add to the repo,

Goodnight :)

The conversion ahead is faster because we're just converting?

I'm not sure to be honest.

abdeladim-s commented 2 months ago

Probably! I am confused to be honest.

UsernamesLame commented 2 months ago

Probably!

I am confused to be honest.

Same here to be completely honest.

It's not like the files are small even after conversion. I guess it's just the context switches I mentioned are really that bad.

abdeladim-s commented 2 months ago

Okay, let's leave this open for now. Hopefully, we will get opinions and experiments from others as well!

BBC-Esq commented 2 months ago

********************
using raw numpy array finished in 1105.0404160022736
********************
neat! Edit: We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes. So far the speed up isn't that promising, but the next check should be memory usage!
Interesting result!
26 minutes for raw wav files, 17 minutes with numpy arrays.

I think we have a winner? Opinion?

Next test will be memory usage I guess.

@abdeladim-s and @UsernamesLame

This conversation caught my attention for some ungodly reason...Anyways, here's my contribution...try using the av library instead of pydub.

Try a script like this and let's see the speed up of the conversion to numpy compared to pydub 👍

import av
import numpy as np
from glob import glob
import os
import time

def convert_to_numpy(file):
    container = av.open(file)
    audio = container.streams.audio[0]

    resampler = av.audio.resampler.AudioResampler(
        format='s16',
        layout='mono',
        rate=16000
    )

    audio_frames = []
    for frame in container.decode(audio):
        resampled_frames = resampler.resample(frame)
        for resampled_frame in resampled_frames:
            audio_frames.append(resampled_frame)

    if not audio_frames:
        return np.array([])

    numpy_array = np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames])
    numpy_array = numpy_array.astype(np.float32)
    numpy_array /= np.iinfo(np.int16).max

    return numpy_array

begin = time.time()
files = [f for f in glob("*") if os.path.isfile(f) and not f.endswith((".npy", ".md", ".txt", ".py", ".cfg"))]

for file in files:
    numpy_array = convert_to_numpy(file)
    with open(f"{file}.npy", "wb") as f:
        np.save(f, numpy_array, allow_pickle=False)

end = time.time()
print(f"{end - begin} seconds elapsed")

abdeladim-s commented 2 months ago

@BBC-Esq, Yes, PyAV is a great library too, but like Pydub, it uses ffmpeg under the hood. Therefore, I believe both will offer similar execution times for our use case.

BBC-Esq commented 2 months ago

@BBC-Esq, Yes, PyAV is a great library too, but like Pydub, it uses ffmpeg under the hood. Therefore, I believe both will offer similar execution times for our use case.

I benched both libraries and initially pydub was faster at 18 seconds and av was slower, but then upon reviewing the documentation I found a way to get av down to 9 seconds. Although they both use ffmpg I'm guessing it's because of the different pipelines and usage that each one offers. I love pydub for ease of use, but it hasn't been updated since 2021 and av is massively maintained, albeit it's more complicated.

Test them out and let me know!

BTW, this was converting the Sam Altman .flac file into a numpy file. It's approximately two hours long...but I'm sure there's ways one could batch multiple files as well.

abdeladim-s commented 2 months ago

@BBC-Esq, Yes, PyAV is a great library too, but like Pydub, it uses ffmpeg under the hood. Therefore, I believe both will offer similar execution times for our use case.

I benched both libraries and initially pydub was faster at 18 seconds and av was slower, but then upon reviewing the documentation I found a way to get av down to 9 seconds. Although they both use ffmpg I'm guessing it's because of the different pipelines and usage that each one offers. I love pydub for ease of use, but it hasn't been updated since 2021 and av is massively maintained, albeit it's more complicated.

Test them out and let me know!

BTW, this was converting the Sam Altman .flac file into a numpy file. It's approximately two hours long...but I'm sure there's ways one could batch multiple files as well.

Okay, so I've tested them out and found that sometimes Pydub is faster, while other times the execution times are similar. You can find the code in this gist I made.

Let me know how you made PyAV faster!

BBC-Esq commented 2 months ago

@BBC-Esq, Yes, PyAV is a great library too, but like Pydub, it uses ffmpeg under the hood. Therefore, I believe both will offer similar execution times for our use case.

I benched both libraries and initially pydub was faster at 18 seconds and av was slower, but then upon reviewing the documentation I found a way to get av down to 9 seconds. Although they both use ffmpg I'm guessing it's because of the different pipelines and usage that each one offers. I love pydub for ease of use, but it hasn't been updated since 2021 and av is massively maintained, albeit it's more complicated. Test them out and let me know! BTW, this was converting the Sam Altman .flac file into a numpy file. It's approximately two hours long...but I'm sure there's ways one could batch multiple files as well.

Okay, so I've tested them out and found that sometimes Pydub is faster, while other times the execution times are similar. You can find the code in this gist I made.

Let me know how you made PyAV faster!

Sure, here's the benchmarking script that I used, you'd just add your own custom path to an audio file at the bottom:

import numpy as np
import time
import os
from pydub import AudioSegment
import av

def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__} took {end - start:.6f} seconds")
        return result
    return wrapper

class AudioConverter:
    def __init__(self, input_file):
        self.input_file = input_file
        self.base_name = os.path.splitext(os.path.basename(input_file))[0]

    def convert_pydub(self):
        start_time = time.perf_counter()
        audio = AudioSegment.from_file(self.input_file)
        audio = audio.set_frame_rate(16000).set_channels(1)

        @timeit
        def np_array_conversion():
            return np.array(audio.get_array_of_samples())

        samples = np_array_conversion()

        @timeit
        def np_float_conversion():
            return samples.astype(np.float32)

        audio_array = np_float_conversion()

        @timeit
        def np_normalization(arr):
            return arr / np.iinfo(np.int16).max

        audio_array = np_normalization(audio_array)

        output_file = f"{self.base_name}_pydub.npy"

        @timeit
        def np_save(arr, file):
            np.save(file, arr)

        np_save(audio_array, output_file)

        end_time = time.perf_counter()
        return end_time - start_time

    def convert_av(self):
        start_time = time.perf_counter()
        container = av.open(self.input_file)
        audio = container.streams.audio[0]

        # Set up the resampler
        resampler = av.audio.resampler.AudioResampler(
            format='s16',
            layout='mono',
            rate=16000
        )

        @timeit
        def get_array_of_samples():
            audio_frames = []
            for frame in container.decode(audio):
                resampled_frames = resampler.resample(frame)
                for resampled_frame in resampled_frames:
                    audio_frames.append(resampled_frame)

            if not audio_frames:
                return np.array([])

            # Concatenate all frames into a single numpy array
            return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames])

        audio_array = get_array_of_samples()

        @timeit
        def np_float_conversion(arr):
            return arr.astype(np.float32)

        audio_array = np_float_conversion(audio_array)

        @timeit
        def np_normalization(arr):
            return arr / np.iinfo(np.int16).max

        audio_array = np_normalization(audio_array)

        output_file = f"{self.base_name}_av.npy"

        @timeit
        def np_save(arr, file):
            np.save(file, arr)

        np_save(audio_array, output_file)

        end_time = time.perf_counter()
        return end_time - start_time

def benchmark(input_file):
    converter = AudioConverter(input_file)

    pydub_time = converter.convert_pydub()
    print(f"Pydub conversion took {pydub_time:.6f} seconds")

    av_time = converter.convert_av()
    print(f"AV conversion took {av_time:.6f} seconds")

if __name__ == "__main__":
    input_file = r"D:\Scripts\bench_cupy\test_flac.flac"
    benchmark(input_file)

BBC-Esq commented 2 months ago

Can you test your script on the same audio file I did...the sam altman interview?

https://huggingface.co/datasets/reach-vb/random-audios/blob/main/sam_altman_lex_podcast_367.flac

BBC-Esq commented 2 months ago

In my above script, these are the results processing the altman audio:

np_array_conversion took 0.083757 seconds
np_float_conversion took 0.102801 seconds
np_normalization took 0.100766 seconds
np_save took 0.271963 seconds
Pydub conversion took 18.740992 seconds
get_array_of_samples took 8.872952 seconds
np_float_conversion took 0.081948 seconds
np_normalization took 0.102745 seconds
np_save took 0.274487 seconds
AV conversion took 9.364995 seconds

I "borrowed" from your script merging a couple steps, which was a cool idea, and there was a small speed increase for pydub:

pydub_to_numpy took 0.312771 seconds
np_save took 0.306345 seconds
Pydub conversion took 17.790903 seconds
av_to_numpy took 8.956160 seconds
np_save took 0.292697 seconds
AV conversion took 9.253701 seconds

BBC-Esq commented 2 months ago

As an ancillary matter, I spend a fair amount of time testing using ffmpeg directly using os in python to run in command line...It was about 8% faster than av no matter no matter how fast I could get av to run, which makes sense considering that av merely wraps ffmpeg and there must be some overhead....And obviously this just pertains to the audio handling and not creating the numpy array/file... With that being said, the benefit of av (or pydub for that matter) is that user's don't have to separately install and add to PATH, which the average non-programmer doesn't know how to do so...

Anyways, just thought it was an interesting conversation and wanted to experiment with it.

I also benched cupy, which allows GPU-acceleration for a lot of numpy's operations (straight cuda and roc-m btw). I'm holding back that script until I perfect it though...It's awesome but I need to get the batch processing optimized. Hehe...

Let me know if my script gives you different results than I got for some reason...

BBC-Esq commented 2 months ago

If you want to see how similar the arrays are you can use something like this as well...

def compare_npy_files(file1, file2, file3):
    arr1 = np.load(file1)
    arr2 = np.load(file2)
    arr3 = cp.asnumpy(cp.load(file3))

    # Compare shapes and adjust if necessary
    min_length = min(arr1.size, arr2.size, arr3.size)
    arr1 = arr1[:min_length]
    arr2 = arr2[:min_length]
    arr3 = arr3[:min_length]

    diff_12 = arr1 - arr2
    diff_13 = arr1 - arr3
    diff_23 = arr2 - arr3

    abs_diff_12 = np.abs(diff_12)
    abs_diff_13 = np.abs(diff_13)
    abs_diff_23 = np.abs(diff_23)

    # Calculate and print histogram of differences
    print("\nHistogram of absolute differences:")
    for diff, label in [(abs_diff_12, "Pydub vs AV"), 
                        (abs_diff_13, "Pydub vs AV CuPy"), 
                        (abs_diff_23, "AV vs AV CuPy")]:
        hist, bin_edges = np.histogram(diff, bins=10)
        print(f"\n{label}:")
        for i, (start, end) in enumerate(zip(bin_edges[:-1], bin_edges[1:])):
            print(f"{start:.2e} to {end:.2e}: {hist[i]} samples")

abdeladim-s commented 2 months ago

Okay, so I tested the script you provided.

First off, we don't need to dump the array to .npy, so I'll comment that part out. Actually, I'm not interested in the other parts except for the actual conversion to NumPy, since the other parts are just NumPy operations and should be basically the same. Surprisingly, in your results, np.save shows a huge difference between the two implementations, which indicates that something might be wrong.

Second, the wrapper timeit isn't a good measure for benchmarking because you can't draw conclusions from just one execution. That's why Python has the timeit utility.

But anyways, let's proceed with the script.

import numpy as np
import time
import os
from pydub import AudioSegment
import av

def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__} took {end - start:.6f} seconds")
        return result
    return wrapper

class AudioConverter:
    def __init__(self, input_file):
        self.input_file = input_file
        self.base_name = os.path.splitext(os.path.basename(input_file))[0]

    def convert_pydub(self):
        start_time = time.perf_counter()
        audio = AudioSegment.from_file(self.input_file)
        audio = audio.set_frame_rate(16000).set_channels(1)

        @timeit
        def np_array_conversion():
            return np.array(audio.get_array_of_samples())

        samples = np_array_conversion()

        @timeit
        def np_float_conversion():
            return samples.astype(np.float32)

        audio_array = np_float_conversion()

        @timeit
        def np_normalization(arr):
            return arr / np.iinfo(np.int16).max

        audio_array = np_normalization(audio_array)

        # output_file = f"{self.base_name}_pydub.npy"

        ## @timeit
        # def np_save(arr, file):
        #     np.save(file, arr)

        # np_save(audio_array, output_file)

        end_time = time.perf_counter()
        return end_time - start_time, audio_array

    def convert_av(self):
        start_time = time.perf_counter()
        container = av.open(self.input_file)
        audio = container.streams.audio[0]

        # Set up the resampler
        resampler = av.audio.resampler.AudioResampler(
            format='s16',
            layout='mono',
            rate=16000
        )

        @timeit
        def get_array_of_samples():
            audio_frames = []
            for frame in container.decode(audio):
                resampled_frames = resampler.resample(frame)
                for resampled_frame in resampled_frames:
                    audio_frames.append(resampled_frame)

            if not audio_frames:
                return np.array([])

            # Concatenate all frames into a single numpy array
            return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames])

        audio_array = get_array_of_samples()

        @timeit
        def np_float_conversion(arr):
            return arr.astype(np.float32)

        audio_array = np_float_conversion(audio_array)

        @timeit
        def np_normalization(arr):
            return arr / np.iinfo(np.int16).max

        audio_array = np_normalization(audio_array)

        # output_file = f"{self.base_name}_av.npy"

        ## @timeit
        # def np_save(arr, file):
        #     np.save(file, arr)

        # # np_save(audio_array, output_file)

        end_time = time.perf_counter()
        return end_time - start_time, audio_array

def benchmark(input_file):
    converter = AudioConverter(input_file)

    pydub_time, pydub_array = converter.convert_pydub()
    print(f"Pydub conversion took {pydub_time:.6f} seconds")
    print(pydub_array.shape)

    av_time, av_array = converter.convert_av()
    print(f"AV conversion took {av_time:.6f} seconds")

    print(av_array.shape)
    assert np.array_equal(pydub_array, av_array) is True

if __name__ == "__main__":
    # input_file = "audio.mp3"
    input_file = "/content/sam_altman_lex_podcast_367.flac"
    benchmark(input_file)

And here are the colab results

np_array_conversion took 0.567625 seconds
np_float_conversion took 0.081097 seconds
np_normalization took 0.215539 seconds
Pydub conversion took 30.776541 seconds
(138181951,)
get_array_of_samples took 29.561768 seconds
np_float_conversion took 0.181955 seconds
np_normalization took 0.287005 seconds
AV conversion took 30.038430 seconds
(138181934,)

As you can see, the conversion time is almost the same, but more importantly, the arrays are not equal—there are some missing numbers in your AV implementation. Unless both implementations produce the same arrays, the comparison doesn't make much sense!

UsernamesLame commented 2 months ago

Can you test your script on the same audio file I did...the sam altman interview?

https://huggingface.co/datasets/reach-vb/random-audios/blob/main/sam_altman_lex_podcast_367.flac

How's the difference between Flac, mp3, wav, etc?

UsernamesLame commented 2 months ago

Okay, so I tested the script you provided.

First off, we don't need to dump the array to .npy, so I'll comment that part out. Actually, I'm not interested in the other parts except for the actual conversion to NumPy, since the other parts are just NumPy operations and should be basically the same. Surprisingly, in your results, np.save shows a huge difference between the two implementations, which indicates that something might be wrong.

Second, the wrapper timeit isn't a good measure for benchmarking because you can't draw conclusions from just one execution. That's why Python has the timeit utility.

But anyways, let's proceed with the script.

import numpy as np
import time
import os
from pydub import AudioSegment
import av

def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__} took {end - start:.6f} seconds")
        return result
    return wrapper

class AudioConverter:
    def __init__(self, input_file):
        self.input_file = input_file
        self.base_name = os.path.splitext(os.path.basename(input_file))[0]

    def convert_pydub(self):
        start_time = time.perf_counter()
        audio = AudioSegment.from_file(self.input_file)
        audio = audio.set_frame_rate(16000).set_channels(1)

        @timeit
        def np_array_conversion():
            return np.array(audio.get_array_of_samples())

        samples = np_array_conversion()

        @timeit
        def np_float_conversion():
            return samples.astype(np.float32)

        audio_array = np_float_conversion()

        @timeit
        def np_normalization(arr):
            return arr / np.iinfo(np.int16).max

        audio_array = np_normalization(audio_array)

        # output_file = f"{self.base_name}_pydub.npy"

        ## @timeit
        # def np_save(arr, file):
        #     np.save(file, arr)

        # np_save(audio_array, output_file)

        end_time = time.perf_counter()
        return end_time - start_time, audio_array

    def convert_av(self):
        start_time = time.perf_counter()
        container = av.open(self.input_file)
        audio = container.streams.audio[0]

        # Set up the resampler
        resampler = av.audio.resampler.AudioResampler(
            format='s16',
            layout='mono',
            rate=16000
        )

        @timeit
        def get_array_of_samples():
            audio_frames = []
            for frame in container.decode(audio):
                resampled_frames = resampler.resample(frame)
                for resampled_frame in resampled_frames:
                    audio_frames.append(resampled_frame)

            if not audio_frames:
                return np.array([])

            # Concatenate all frames into a single numpy array
            return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames])

        audio_array = get_array_of_samples()

        @timeit
        def np_float_conversion(arr):
            return arr.astype(np.float32)

        audio_array = np_float_conversion(audio_array)

        @timeit
        def np_normalization(arr):
            return arr / np.iinfo(np.int16).max

        audio_array = np_normalization(audio_array)

        # output_file = f"{self.base_name}_av.npy"

        ## @timeit
        # def np_save(arr, file):
        #     np.save(file, arr)

        # # np_save(audio_array, output_file)

        end_time = time.perf_counter()
        return end_time - start_time, audio_array

def benchmark(input_file):
    converter = AudioConverter(input_file)

    pydub_time, pydub_array = converter.convert_pydub()
    print(f"Pydub conversion took {pydub_time:.6f} seconds")
    print(pydub_array.shape)

    av_time, av_array = converter.convert_av()
    print(f"AV conversion took {av_time:.6f} seconds")

    print(av_array.shape)
    assert np.array_equal(pydub_array, av_array) is True

if __name__ == "__main__":
    # input_file = "audio.mp3"
    input_file = "/content/sam_altman_lex_podcast_367.flac"
    benchmark(input_file)

And here are the colab results

np_array_conversion took 0.567625 seconds
np_float_conversion took 0.081097 seconds
np_normalization took 0.215539 seconds
Pydub conversion took 30.776541 seconds
(138181951,)
get_array_of_samples took 29.561768 seconds
np_float_conversion took 0.181955 seconds
np_normalization took 0.287005 seconds
AV conversion took 30.038430 seconds
(138181934,)

As you can see, the conversion time is almost the same, but more importantly, the arrays are not equal—there are some missing numbers in your AV implementation. Unless both implementations produce the same arrays, the comparison doesn't make much sense!

Yea dumping to numpy was my idea for preprocessing thousands of files ahead of time to distribute across multiple whisper inference nodes.

abdeladim-s commented 2 months ago

As an ancillary matter, I spend a fair amount of time testing using ffmpeg directly using os in python to run in command line...It was about 8% faster than av no matter no matter how fast I could get av to run, which makes sense considering that av merely wraps ffmpeg and there must be some overhead....And obviously this just pertains to the audio handling and not creating the numpy array/file... With that being said, the benefit of av (or pydub for that matter) is that user's don't have to separately install and add to PATH, which the average non-programmer doesn't know how to do so...

Anyways, just thought it was an interesting conversation and wanted to experiment with it.

I also benched cupy, which allows GPU-acceleration for a lot of numpy's operations (straight cuda and roc-m btw). I'm holding back that script until I perfect it though...It's awesome but I need to get the batch processing optimized. Hehe...

Let me know if my script gives you different results than I got for some reason...

@BBC-Esq,

Yes, Pydub is just for someone who wants to quickly test things out without having to convert their media files beforehand, and that's why I made the transcribe function accept NumPy arrays as well!
Cupy is awesome. let us how it goes.

abdeladim-s commented 2 months ago

Can you test your script on the same audio file I did...the sam altman interview? huggingface.co/datasets/reach-vb/random-audios/blob/main/sam_altman_lex_podcast_367.flac

How's the difference between Flac, mp3, wav, etc?

@UsernamesLame, Each format is encoded in a certain way so I suppose there might be some difference.

BBC-Esq commented 2 months ago

Okay, so I tested the script you provided.

First off, we don't need to dump the array to .npy, so I'll comment that part out. Actually, I'm not interested in the other parts except for the actual conversion to NumPy, since the other parts are just NumPy operations and should be basically the same. Surprisingly, in your results, np.save shows a huge difference between the two implementations, which indicates that something might be wrong.

Second, the wrapper timeit isn't a good measure for benchmarking because you can't draw conclusions from just one execution. That's why Python has the timeit utility.

But anyways, let's proceed with the script.

import numpy as np
import time
import os
from pydub import AudioSegment
import av

def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__} took {end - start:.6f} seconds")
        return result
    return wrapper

class AudioConverter:
    def __init__(self, input_file):
        self.input_file = input_file
        self.base_name = os.path.splitext(os.path.basename(input_file))[0]

    def convert_pydub(self):
        start_time = time.perf_counter()
        audio = AudioSegment.from_file(self.input_file)
        audio = audio.set_frame_rate(16000).set_channels(1)

        @timeit
        def np_array_conversion():
            return np.array(audio.get_array_of_samples())

        samples = np_array_conversion()

        @timeit
        def np_float_conversion():
            return samples.astype(np.float32)

        audio_array = np_float_conversion()

        @timeit
        def np_normalization(arr):
            return arr / np.iinfo(np.int16).max

        audio_array = np_normalization(audio_array)

        # output_file = f"{self.base_name}_pydub.npy"

        ## @timeit
        # def np_save(arr, file):
        #     np.save(file, arr)

        # np_save(audio_array, output_file)

        end_time = time.perf_counter()
        return end_time - start_time, audio_array

    def convert_av(self):
        start_time = time.perf_counter()
        container = av.open(self.input_file)
        audio = container.streams.audio[0]

        # Set up the resampler
        resampler = av.audio.resampler.AudioResampler(
            format='s16',
            layout='mono',
            rate=16000
        )

        @timeit
        def get_array_of_samples():
            audio_frames = []
            for frame in container.decode(audio):
                resampled_frames = resampler.resample(frame)
                for resampled_frame in resampled_frames:
                    audio_frames.append(resampled_frame)

            if not audio_frames:
                return np.array([])

            # Concatenate all frames into a single numpy array
            return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames])

        audio_array = get_array_of_samples()

        @timeit
        def np_float_conversion(arr):
            return arr.astype(np.float32)

        audio_array = np_float_conversion(audio_array)

        @timeit
        def np_normalization(arr):
            return arr / np.iinfo(np.int16).max

        audio_array = np_normalization(audio_array)

        # output_file = f"{self.base_name}_av.npy"

        ## @timeit
        # def np_save(arr, file):
        #     np.save(file, arr)

        # # np_save(audio_array, output_file)

        end_time = time.perf_counter()
        return end_time - start_time, audio_array

def benchmark(input_file):
    converter = AudioConverter(input_file)

    pydub_time, pydub_array = converter.convert_pydub()
    print(f"Pydub conversion took {pydub_time:.6f} seconds")
    print(pydub_array.shape)

    av_time, av_array = converter.convert_av()
    print(f"AV conversion took {av_time:.6f} seconds")

    print(av_array.shape)
    assert np.array_equal(pydub_array, av_array) is True

if __name__ == "__main__":
    # input_file = "audio.mp3"
    input_file = "/content/sam_altman_lex_podcast_367.flac"
    benchmark(input_file)

And here are the colab results

np_array_conversion took 0.567625 seconds
np_float_conversion took 0.081097 seconds
np_normalization took 0.215539 seconds
Pydub conversion took 30.776541 seconds
(138181951,)
get_array_of_samples took 29.561768 seconds
np_float_conversion took 0.181955 seconds
np_normalization took 0.287005 seconds
AV conversion took 30.038430 seconds
(138181934,)

As you can see, the conversion time is almost the same, but more importantly, the arrays are not equal—there are some missing numbers in your AV implementation. Unless both implementations produce the same arrays, the comparison doesn't make much sense!

I wouldn't recommend benchmarking on colab but rather on one's own computer. :-) Anyways, when I ran the script you gave me verbatim except changing the file path, I received this error:

  File "D:\Scripts\bench_cupy\convert_to_numpy_abet.py", line 117, in benchmark
    assert np.array_equal(pydub_array, av_array) is True
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

When I ran this modified script from my first version, albeit removing the creation of the numpy file, I received these results:

pydub_to_numpy took 0.279033 seconds
Pydub conversion took 17.316214 seconds
av_to_numpy took 8.862365 seconds
AV conversion took 8.865208 seconds

Here is the modified script:

SCRIPT JUST NOT SAVING NUMPY FILE

``` import numpy as np import time import os from pydub import AudioSegment import av def timeit(func): def wrapper(*args, **kwargs): start = time.perf_counter() result = func(*args, **kwargs) end = time.perf_counter() print(f"{func.__name__} took {end - start:.6f} seconds") return result return wrapper class AudioConverter: def __init__(self, input_file): self.input_file = input_file self.base_name = os.path.splitext(os.path.basename(input_file))[0] def convert_pydub(self): start_time = time.perf_counter() audio = AudioSegment.from_file(self.input_file) audio = audio.set_frame_rate(16000).set_channels(1) @timeit def pydub_to_numpy(): return np.array(audio.get_array_of_samples()).astype(np.float32) / np.iinfo(np.int16).max audio_array = pydub_to_numpy() end_time = time.perf_counter() return end_time - start_time def convert_av(self): start_time = time.perf_counter() container = av.open(self.input_file) audio = container.streams.audio[0] # Set up the resampler resampler = av.audio.resampler.AudioResampler( format='s16', layout='mono', rate=16000 ) @timeit def av_to_numpy(): audio_frames = [] for frame in container.decode(audio): resampled_frames = resampler.resample(frame) for resampled_frame in resampled_frames: audio_frames.append(resampled_frame) if not audio_frames: return np.array([]) # Concatenate all frames into a single numpy array, convert to float32, and normalize return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames]).astype(np.float32) / np.iinfo(np.int16).max audio_array = av_to_numpy() end_time = time.perf_counter() return end_time - start_time def benchmark(input_file): converter = AudioConverter(input_file) pydub_time = converter.convert_pydub() print(f"Pydub conversion took {pydub_time:.6f} seconds") av_time = converter.convert_av() print(f"AV conversion took {av_time:.6f} seconds") if __name__ == "__main__": input_file = r"D:\Scripts\bench_cupy\sam_altman_lex_podcast_367.flac" benchmark(input_file) ```

Also, the differences are so miniscule that they don't matter as a practical matter. And who is to say that pydub is correct and not av or vice versa. Moreover, I haven't verified whether both use the same samplers and other sub-libraries so...perhaps they're both "correct" in that the very very very minor differences are due to the different sub-libraries and/or versions of them.

If you use the compare_npy_files function it'll show the miniscule difference.

Here's an example:

0.00e+00 to 3.34e-02: 137870152 samples
3.34e-02 to 6.69e-02: 285089 samples
6.69e-02 to 1.00e-01: 23120 samples
1.00e-01 to 1.34e-01: 2989 samples
1.34e-01 to 1.67e-01: 477 samples
1.67e-01 to 2.01e-01: 91 samples
2.01e-01 to 2.34e-01: 10 samples
2.34e-01 to 2.67e-01: 3 samples
2.67e-01 to 3.01e-01: 1 samples
3.01e-01 to 3.34e-01: 2 samples

BBC-Esq commented 2 months ago

Another possible difference could be the libraries me versus whatever cloud service on Google's servers are installed...Perhaps that accounts for the divergent results in part...and the fact that I received an error while the Colab worked...

abdeladim-s / pywhispercpp

Performance Improvement ideas / feature requests #49