ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.17k stars 3.59k forks source link

Doubt with streaming use case no_context's default value #432

Open debasish-mihup opened 1 year ago

debasish-mihup commented 1 year ago

From what I can see no_context = true by default and until and unless we explicitly set it to false, through "-kc" or "--keep-context" optional, I believe the context will not be maintained between subsequent ASR transcribe calls. So my question it to improve the ASR output quality, should not this value be false by default inside the streaming example code?

ggerganov commented 1 year ago

You are correct that having more context in general improves the quality of the ASR quality. However, more text context makes the decoder slower, so I thought it is OK to sacrifice some quality in order to have more efficiency by default.

There is also another factor - since the audio is currently chunked in naive way, we can get partial words at the end/start of the audio chunks. These can be transcribed into something invalid and from there, by keeping the context, this error can propagate further.

I think when we implement #426 , the --keep-context should become more robust and maybe it would become on by default.

debasish-mihup commented 1 year ago

@ggerganov I have one more question. I am trying to run the stream inference from Python. There are primarily two level of retaining previous information - Keeping 200ms slice of last audio chunk's audio data and optional context data. Audio data is straight forward from python implementation point. For the context I have obtained the context value as a list using below code. Now the question is how to pass this python list to cpp side of the code during the next audio chunk iteration?

        tokens: List = []
        n_segments = whisper_cpp.whisper_full_n_segments((self.ctx))
        for i in range(n_segments):
            token_count = whisper_cpp.whisper_full_n_tokens((self.ctx), i);
            for j in range(token_count):
                tokens.append(whisper_cpp.whisper_full_get_token_id((self.ctx), i, j)) 
debasish-mihup commented 1 year ago

@ggerganov OK. I have been able to pass the pointer to c++ code from python by converting it to numpy array and then casting the pointer to ctypes pointer. The below code gives memory access violation if the length of dummy_context is less than or equal to 2 and works otherwise. Although I am not sure whether this works fine or not. Can you check this approach or am I missing something and why would there be access violation?

        dummy_context = [4236, 211]
        nd_arr = np.array(dummy_context, dtype=np.int32)
        self.params.prompt_n_tokens = nd_arr.size
        INTP = ctypes.POINTER(ctypes.c_int32)
        nd_arr_pointer = ctypes.cast(nd_arr.ctypes.data, INTP)
        self.params.prompt_tokens = nd_arr_pointer # Memory access violation if dummy_context length is <= 2

result = whisper_cpp.whisper_full(
OSError: exception: access violation reading 0x00000389A087FAA0
aehlke commented 11 months ago

Now that it's been a while, how has your experience been with no_context = false for streaming?