V-Sekai / godot-whisper

An GDExtension addon for the Godot Engine that enables realtime audio transcription, supports OpenCL for most platforms, Metal for Apple devices, and runs on a separate thread.
MIT License
48 stars 5 forks source link

Repetition in recordings #72

Open gudatr opened 1 month ago

gudatr commented 1 month ago

So far everything has been working out of the box, so thank you for this great plugin!

Issue: I'm having problems with repetition. Recognition is good, but the same sentence is repeated over and over.

What I have tried: From what I can see in the whisper documentation, the entropy threashold should fix this. But there seems to be no effect when I change the value.

entropy 2.8, default

image

entropy 5

image

entropy 0

image

If at all higher values make recognition less precise.

Is this related to the other problem regarding Voice Activation Detection? I have tried changing the VAD threshold as well but that seems to be doing nothing.

I have also tried using a larger whisper model but that yields the same results, only slower.

gudatr commented 1 month ago

So I replaced the Capture Effect of the audio bus with a Record Effect. I used linear Interpolation to resample the data i got from GetRecording() from 48000 to 16000. This works with an astounding accuracy of ~95% ( I am not a native english speaker). No repetition, even recognizes names correctly.

While this approach works for me, i just couldnt get the sample capture implementation to work.

Ughuuu commented 1 month ago

Interesting, this sounds like it could be an issue with how I am doing the interpolation. This plugin currently uses libsamplerate for that, as seen here: https://github.com/V-Sekai/godot-whisper/blob/main/src/speech_to_text.cpp#L32

The resample function also exposes a InterpolatorType:

    enum InterpolatorType {
        SRC_SINC_BEST_QUALITY = 0,
        SRC_SINC_MEDIUM_QUALITY = 1,
        SRC_SINC_FASTEST = 2,
        SRC_ZERO_ORDER_HOLD = 3,
        SRC_LINEAR = 4,
    };

By default it's set to FASTEST https://github.com/V-Sekai/godot-whisper/blob/c3682d7350454c208809584849806d7303a9be5d/bin/addons/godot_whisper/capture_stream_to_text.gd#L66

You could also give a try to set it to BEST_QUALITY see if there is a change. If not the solution/approach you did is pretty good as well, if you want you can make a new scene with it and add a PR for others to try.(if not I might if I get some time).

gudatr commented 1 month ago

@Ughuuu I have implemented this in C#, here https://github.com/gudatr/godot-ai-rpg/blob/main/scripts/SpeechRecognizer.cs but it greatly differs from the examples of the project. I tried writing the code in gdscript but I must admit that I am too inexperienced with it, especially if the implementation needs to be close to the samples, and currently have no motiviation to learn it, sorry.

Ughuuu commented 1 month ago

No worries, thanks for this, it's great! If anything it's a sample people can look at if they want to do sampling manually. I'm also busy but maybe in future I might take a stab at it.