Const-me / Whisper

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model
Mozilla Public License 2.0
8.2k stars 702 forks source link

How to send PCM Data when using whisper.dll as a library #155

Closed endink closed 1 year ago

endink commented 1 year ago

The current implementation requires the microphone/file/file content, and how to used it for pure PCM byte array (excluding file headers, float32[] or int16[]).

If we construct the wav file header, it will have to be a memory copy.

Const-me commented 1 year ago

@endink To send PCM data to the library, implement iAudioBuffer COM interface. It only has 4 methods, here's a link to C# language projection which has documentation: https://github.com/Const-me/Whisper/blob/master/WhisperNet/API/iAudioBuffer.cs

Input data must be 16kHz sample rate, and the samples need to be FP32 numbers in [-1.0 .. +1.0] interval.

If you have inputs in C# arrays allocated on the managed heap, you gonna need GCHandle to make sure the GC doesn’t relocate the data while C++ code is reading from there.

endink commented 1 year ago

Thanks for your reply.

I am very familiar with C++ and C#, but not familiar with Windows COM, in this case I am using C++. It seems that I need to learn COM?

endink commented 1 year ago

In fact, the main problem is the lack of an Load Bytes function to get an iAudioBuffer, the only function available is Load from File path.

I really don't want to modify the source code, that would break me upgrading from upstream.

I think we actually need an API, load from the "Pure C++ interface", so that we can easily implement what we want in C++.

Const-me commented 1 year ago

@endink If you write C++ and you want to supply that PCM data without modifying the library too much, implement that COM interface in some C++ class. COM is an ABI, but it’s pretty close to pure C++ interfaces. The only difference, COM interfaces need to inherit from IUnknown which implements reference counting (and therefore lifetime stuff: once the counter reaches zero, objects should destroy themselves deallocating memory), and coarse-grained reflection with QueryInterface method.

If you’re on Windows, you can use either my ComLight, or Microsoft’s ATL to make that class, they do about the same thing. You can skip either of them, you’ll only have to implement these 3 critical IUnknown methods yourself.

endink commented 1 year ago

Thanks for the advice, I read your code, if I implement a class that inherits from ComLight::ObjectRoot<iAudioBuffer> , and it can be send to your API, am I right ?

endink commented 1 year ago

I think , I can do it:

class A : public ComLight::ObjectRoot<iAudioBuffer>
{
    ....
}

ComLight::CComPtr<ComLight::Object<A>> obj;
CHECK( ComLight::Object<A>::create( obj ) );
obj.detach( pp );

So , i can get a A instance, and call Release when I don't need it anymore.

endink commented 1 year ago

Sorry, this is not as simple as I think, but for now I still use the wav header to avoid modifying the source code, because this interface implementation needs to interact with interfaces such as IMFSample and needs to be familiar with Windows MF...


Hope you can export an api to C++ that allows us to simply interact with COM using byte arrays, or use audio data (float32 or int16 data, channels, sample, etc), which is a huge win for all C++ programmers.

BTW, this library is cool, even if I copy the array to build a wav file in memory every time(now I'm doing, it is still much faster than whisper.cpp, and it doesn't need CUDA, this is the most magical.

I'm guessing this magic can also be applied to llama.cpp, since they are all GGML behind them

endink commented 1 year ago

BTW, my implementation to those who come after in need:


struct WavPCMFileHeader
{
    struct RIFF {
        const   char rift[4] = { 'R','I', 'F', 'F' };
        uint32_t fileLength;
        const   char wave[4] = { 'W','A', 'V', 'E' };
    } riff;
    struct Format
    {
        const   char fmt[4] = { 'f','m', 't', ' ' };
        uint32_t blockSize = 16;
        uint16_t formatTag;
        uint16_t channels;
        uint32_t samplesPerSec;
        uint32_t avgBytesPerSec;
        uint16_t blockAlign;
        uint16_t  bitsPerSample;
    }format;
    struct  Data
    {
        const   char data[4] = { 'd','a', 't', 'a' };
        uint32_t dataLength;
    }data;

    WavPCMFileHeader(): riff(), format(), data()
    {
    }

    WavPCMFileHeader(int nCh, int  nSampleRate, int  bitsPerSample, int dataSize) {
        riff.fileLength = 36 + dataSize;
        format.formatTag = 1;
        format.channels = nCh;
        format.samplesPerSec = nSampleRate;
        format.avgBytesPerSec = nSampleRate * nCh * bitsPerSample / 8;
        format.blockAlign = nCh * bitsPerSample / 8;
        format.bitsPerSample = bitsPerSample;
        data.dataLength = dataSize;
    }
};

usage:

int16_t* samples = nullptr;
int num_samples = 0;

//...get samples and num of samples here

WavPCMFileHeader header(1, 16000, 8 * 2, static_cast<size_t>(samples) * sizeof(int16_t));
auto dataBytes = samples * sizeof(int16_t);
size_t size = sizeof(header) + dataBytes;
if(_buffer.size() != size)
{
   _buffer.resize(sizeof(header) + dataBytes, 0);
}
memcpy(_buffer.data(), &header, sizeof(header));
memcpy(_buffer.data() + sizeof(header), samples, dataBytes);

iAudioReader* buffer = nullptr;
CHECK_HRESULT_OK(_mf->loadAudioFileData(_buffer.data(), _buffer.size(), false, &buffer), false)

//So, we get the iAudioReader interface