Const-me / Whisper

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model
Mozilla Public License 2.0
8.05k stars 681 forks source link

STREAM_AUDIO and multiple files #73

Open checksummaster opened 1 year ago

checksummaster commented 1 year ago

When I do main.exe -m ggml-base.bin -f 1.wav 2.wav 3.wav 1.wav 2.wav where wav1 = 1,2,3 (I count and record it in 16bit wav file) wav2 = 4,5,6 wav3 = 7,8,9

Its success in doing 1.wav but the other ones it random (not the same result each time I run it). It gives me often [music] (but I talk in each wave file).


Created source reader from the file "1.wav"

[00:00:00.000 --> 00:00:04.000] One, two, three. Created source reader from the file "2.wav"

[00:00:00.000 --> 00:00:04.000] 4, 5, 6 Created source reader from the file "3.wav"

[00:00:00.000 --> 00:00:04.000] [Music] Created source reader from the file "1.wav"

[00:00:00.000 --> 00:00:05.000] [Music] Created source reader from the file "2.wav"

[00:00:00.000 --> 00:00:04.000] [Music]

If I set STREAM_AUDIO=0, it works all the time.


Loaded audio file from "1.wav": 77824 samples, 4.864 seconds

[00:00:00.000 --> 00:00:04.000] One, two, three. Loaded audio file from "2.wav": 69632 samples, 4.352 seconds

[00:00:00.000 --> 00:00:04.000] 4, 5, 6 Loaded audio file from "3.wav": 73728 samples, 4.608 seconds

[00:00:00.000 --> 00:00:04.000] 7, 8, 9 Loaded audio file from "1.wav": 77824 samples, 4.864 seconds

[00:00:00.000 --> 00:00:04.000] One, two, three. Loaded audio file from "2.wav": 69632 samples, 4.352 seconds

[00:00:00.000 --> 00:00:04.000] 4, 5, 6

I know, setting STREAM_AUDIO=0 is like a fix (bad one, but it still looks like a fix)

My problem with this bug is I want to use a memory buffer with some raw data from special hardware. I prefix my buffer with some wav header, then use loadAudioFileData and runStreamed. It is the same path as when STREAM_AUDIO=1

I don't know yet how I can do the same using runFull or using (or fix what wrong with STREAM_AUDIO=1)

checksummaster commented 1 year ago

Snipped for testing with memory buffer

` std::atomic_bool is_aborted = false; { wparams.encoder_begin_callback = &beginSegmentCallback; wparams.encoder_begin_callback_user_data = &is_aborted; }

    if (true) {
        printf("----- USING BUFFER-------");
        std::ifstream file(fname, std::ios::binary);
        file.seekg(0, std::ios::end);
        size_t fileSize = file.tellg();
        file.seekg(0, std::ios::beg);
        std::vector<std::byte> data(fileSize);
        file.read(reinterpret_cast<char*>(data.data()), fileSize);
        ComLight::CComPtr<iAudioReader> reader;
        CHECK(mf->loadAudioFileData(data.data(), data.size(), false, &reader));
        sProgressSink progressSink{ nullptr, nullptr };
        hr = context->runStreamed(wparams, progressSink, reader);
    }
    else {

        if (STREAM_AUDIO && !wparams.flag(eFullParamsFlags::TokenTimestamps))
        {
            ComLight::CComPtr<iAudioReader> reader;
            CHECK(mf->openAudioFile(fname.c_str(), params.diarize, &reader));`
emcodem commented 1 year ago

I dont like these kind of issues, there is tons of software out there that is able to concat files, record, filter, extract and transform audio... From my perspective it would be best to provide a read from stdin interface so we can easily interface with ffmpeg and such

checksummaster commented 1 year ago

The bug I report is ... As is, it not work with multiple files. Using ffmeg with pipe or anything will not change it.

The bug seem relative to the streaming vs using buffer. By default (without changing code and recompile) it use streaming.

I'm sorry I should not put example of using buffer in the description, it make confusion.

Until the stream problem is solve, I make a pull request that work with buffer and multiple buffer.

Why I do that, is because I want to initialize whisper engine then just send new buffer when I need it (not reinitialise take something like half the time...but not confirmed yet).

I could just recompile whisper.dll with stream_audio=0 then save the wav on disk and call it with reader function but read/write on disk ... again take time.

checksummaster commented 1 year ago

When I said buffer, I mean wav file in memory... (Versus wav file on disk)

checksummaster commented 1 year ago

But I agree with emcodem (it not relative to this bug but) having something like that can be very cool

ffmpeg -i mystuff.mp3 -f s16le -acodec pcm_s16le -ac 1 -ar 16000 pipe:1 | whisper.exe -f s16le --channel 1 --samplerate 16000 > mystuff.txt

And let's say that my mp3 is a recording that is 7 days long, and it does not crash because there is no memory ;)

emcodem commented 1 year ago

Sorry for not reading and understanding your stuff. As far as i see setting STREAM_AUDIO=0 forces to run on CPU, also your PR does that. This makes me wonder if any of this belongs to this libarary instead of the original cpp version? Anyway, i tried reproducing your issue, feeding 2 different short wav files repeatedly (GPU/default Version) but it always worked for me. For practicing, i'd like to see if i find out what's the issue with the GPU version for you, if you like please provide your input files.

checksummaster commented 1 year ago

Just be sure that no other option are on the line, like I think that set output to wts make same as set stream_audio=0.... Also when stream_audio is 0, the function still very fast, like gpu still used... I'm not near my PC but I will check monday