elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.37k stars 102 forks source link

Explore silence detection in speech-to-text #379

Open jonatanklosko opened 4 months ago

jonatanklosko commented 4 months ago

Whisper may hallucinate text when an audio chunk is silence or noise (see https://github.com/elixir-nx/bumblebee/issues/377#issuecomment-2208521942). The openai-whisper implementation has no_speech_threshold and logprob_threshold that may be related. By a quick search there are a few discussions around Whisper hallucination, it may be worth experimenting if there's something we can incorporate into the current algorithm.

noozo commented 2 months ago

Any progress on this? Compared to python the transcripts provided by bumblebee are pretty bad. Lots of repetition of sentences, missing text, etc. We are on the verge of giving up and moving to SaaS for this, unfortunately :(

josevalim commented 2 months ago

PRs are definitely welcome.

tubedude commented 1 month ago

Jonatan, Valim, I was looking at this issue and thought of implementing a "silence_processor" as part of the logits_processors.

So I thought of changing these two files:

I still need to review and test all the logic. But do you think this would be the place to implement this processor?

jonatanklosko commented 1 month ago

@tubedude unfortunately it doesn't fit into the usual logits processing approach. We generate the transcription token-by-token, and logits processing applies some transformation to logits at each iteration. My understanding is that the <|nospeech|> token is a (somewhat hacky) voice activity detection for the whole input chunk. What openai-whisper does is, it tracks <|nospeech|> probability only from the last iteration (last token prediction) and then uses that, combined with average logprob, to determine if the whole chunk should be skipped.

While looking around, I noticed that huggingface/transformers made significant changes to long-form transcription within the last year. They added support for sequential transcription of long inputs, similar to openai-whisper for improved transcription quality. The implementation involves several techniques, including the nospeech detection. They do use logits processor as part of this, however not to alter the logits, but rather to accumulate information in the object state, and extract it later, when deciding if a chunk is silence (the authors actually consider it hacky, but that's what they did to match the openai implementation, ref). This hack doesn't really fit into our functional implementation; but regardless it is only applicable within the new long-form implementation. The two main PRs with the new changes are https://github.com/huggingface/transformers/pull/27492 and https://github.com/huggingface/transformers/pull/27658.

So taking a step back, huggingface/transformers now has two separate approaches for long-form transcription (a) "sequential" long-input generation (which does the nonspeech detection among other techniques) (b) chunked generation with output merging. Our current implementation does (b). Maintaining both, especially with streaming, is most likely too much. Implementing (a) is a lot of work and I think there are challenges related to serving and streaming, because the input slice points are not known upfront (offsets are adjusted on each iteration).

All that said, I think it may be worth looking at the PRs, the paper mentioned in those PRs, and consider a different implementation for long-form transcription. Given the complexity, I can't really point to anything directly actionable, and it's not something we can prioritize at the moment.