Closed Ughuuu closed 9 months ago
@aiaimimi0920 Is the silence mic hallucination better now?
no, now when you not talking ,it will automatically adds "Thank you" or "you" like this issue: https://github.com/ggerganov/whisper.cpp/issues/1592
maybe I can use the method mentioned in this issue to solve it
Is this fixed by the latest changes?
Not resolved.
I just added a bloacking character map like "thanks" and "thanks you". But when using other languages, there is still a high probability of hallucinatory characters text, such as “xx字幕”,“谢谢你”
Because when I check whether the audio is pure silent in the "add buffer", it will detect whether the energy is less than a certain value. If your environment is very quiet, it won't generate hallucinatory text. But if your environment is somewhat noisy, such as the sound of a fan, it will still be judged as a valid audio file entering the inference stage, and then generate generating hallucinatory text
Possible solutions:
I've seen solutions like implementing an audio denoise https://github.com/snakers4/silero-vad but it seems more work to use than ggml-whisper.
It is a PyTorch model, and we may need a C++ implementation, or use iree.gd?
I can't estimate how long it will take to combine silero-vad with Whisper together but it should be feasible inside of iree.gd.
Exposed the vad option for threshold. Also made it so that if it's halucinating, only take max 4 tokens, and no more(as they could be legit characters). Anything else should fall outside of this repo(eg. processing of text with iree.gd). We can open another issue for that after iree.gd is released.
It's not very bad as is, maybe we just expose the vad option threshold and play with it.