Open gudatr opened 1 month ago
So I replaced the Capture Effect of the audio bus with a Record Effect. I used linear Interpolation to resample the data i got from GetRecording() from 48000 to 16000. This works with an astounding accuracy of ~95% ( I am not a native english speaker). No repetition, even recognizes names correctly.
While this approach works for me, i just couldnt get the sample capture implementation to work.
Interesting, this sounds like it could be an issue with how I am doing the interpolation. This plugin currently uses libsamplerate for that, as seen here: https://github.com/V-Sekai/godot-whisper/blob/main/src/speech_to_text.cpp#L32
The resample function also exposes a InterpolatorType
:
enum InterpolatorType {
SRC_SINC_BEST_QUALITY = 0,
SRC_SINC_MEDIUM_QUALITY = 1,
SRC_SINC_FASTEST = 2,
SRC_ZERO_ORDER_HOLD = 3,
SRC_LINEAR = 4,
};
By default it's set to FASTEST https://github.com/V-Sekai/godot-whisper/blob/c3682d7350454c208809584849806d7303a9be5d/bin/addons/godot_whisper/capture_stream_to_text.gd#L66
You could also give a try to set it to BEST_QUALITY see if there is a change. If not the solution/approach you did is pretty good as well, if you want you can make a new scene with it and add a PR for others to try.(if not I might if I get some time).
@Ughuuu I have implemented this in C#, here https://github.com/gudatr/godot-ai-rpg/blob/main/scripts/SpeechRecognizer.cs but it greatly differs from the examples of the project. I tried writing the code in gdscript but I must admit that I am too inexperienced with it, especially if the implementation needs to be close to the samples, and currently have no motiviation to learn it, sorry.
No worries, thanks for this, it's great! If anything it's a sample people can look at if they want to do sampling manually. I'm also busy but maybe in future I might take a stab at it.
So far everything has been working out of the box, so thank you for this great plugin!
Issue: I'm having problems with repetition. Recognition is good, but the same sentence is repeated over and over.
What I have tried: From what I can see in the whisper documentation, the entropy threashold should fix this. But there seems to be no effect when I change the value.
entropy 2.8, default
entropy 5
entropy 0
If at all higher values make recognition less precise.
Is this related to the other problem regarding Voice Activation Detection? I have tried changing the VAD threshold as well but that seems to be doing nothing.
I have also tried using a larger whisper model but that yields the same results, only slower.