bnosac / audio.whisper

Transcribe audio files using the "Whisper" Automatic Speech Recognition model from R
Other
102 stars 11 forks source link

Notes on repetitions #38

Open jwijffels opened 5 months ago

jwijffels commented 5 months ago

Strategies to reduce repetitions / hallucinations

Use 5 beams Increase entropy threshold from the default 2.4 to 2.8 for example. Higher threshold will reject repetitive text and fallback to sampling with higher temperature Reduce the maximum context size (--max-context). By default it is 224. Setting it to 64 or 32 can reduce the repetitions significantly. Setting it to 0 will most likely eliminate all repetitions, but the transcription quality can be affected because it will be losing the context from the previous transcript

Related to timestamps: see https://github.com/ggerganov/whisper.cpp/issues/1724

jwijffels commented 5 months ago

TODO: add R function to detect repetitions, the location in the audio/transcription where this occurs and after which the model does not recover, such that it can be used to relaunch the transcription with other settings or a better model.

jmgirard commented 3 months ago

I've been running into this issue a lot with large-v3. Makes it basically unusable for my purposes. Sounds like v2 may be better?

jwijffels commented 3 months ago

yes, large-v2 or medium and remove silences - best model for silence removal is Silero, webrtc is a lot faster but less accurate.

Next plug in the detected non-silence periods in the predict function - either use argument sections (which will create a new audio file based on these voiced sections) or arguments offset/duration (which will also look a bit around the cutoff timepoints) - available since audio.whisper 0.4

Next to that, I hope https://github.com/ggerganov/whisper.cpp/pull/1768 will also make improvements once incorporated in whisper.cpp and in audio.whisper

jmgirard commented 3 months ago

large-v2 seems to be doing better (even without removing the silences). Interestingly, it is also running a lot faster than v3, presumably because it is not wasting as much time hallucinating. Trying audio.vadsilero now... Moved discussion over to #62