V-Sekai / godot-whisper

An GDExtension addon for the Godot Engine that enables realtime audio transcription, supports OpenCL for most platforms, Metal for Apple devices, and runs on a separate thread.
MIT License
48 stars 5 forks source link

On silence, the mic hallucinates #68

Open aiaimimi0920 opened 2 months ago

aiaimimi0920 commented 2 months ago

The current version(bd9c18a4ce614b511216757d5962e934b56b2d09) also has a large amount of output when the microphone is silent https://github.com/V-Sekai/godot-whisper/assets/153103332/13ce75ed-2c6f-4224-bdd7-6bc0b118caa2

I remember the previous version(three months ago?) didn't seem to have so many microphone hallucinations https://github.com/V-Sekai/godot-whisper/assets/153103332/aad89a0c-f965-4349-9b4c-6d0233161b79

If possible, it would be best to solve this problem

Ughuuu commented 2 months ago

That's true. In new version i decoupled the logic as much as possible, so it can be called from gdscript independently. Its true halucination is worse. I'll try look into combining iree.gd for hallucination, now that thats done. @fire ? Ideas?

fire commented 2 months ago

People have mentioned combining silence detection with whisper as a first thought, but I am concerned about the total latency of the voice transcription.

Ughuuu commented 2 months ago

I see. I'll look into the vad_detection logic, most likely that one when I migrated I didn't do it right. I'll look at old version and see what is different in this one.

fire commented 2 months ago

AI based VAD is also a thing, and that was my approach for iree and whisper-jax.

Ughuuu commented 2 months ago

The silence part maybe works, some parts in project settings: -audio/input/transcribe/vad_treshold -audio/input/transcribe/freq_treshold Need to be configured.

For now increasing vad_treshold to 2, as that seems to give good results in my case. Increasing it to 5 is even better in terms of silence detection.

Ughuuu commented 2 months ago

@aiaimimi0920 , lmk if u get a chance to try it.