Closed b0xtch closed 1 year ago
Chunking is the next planned feature. Right now it clips audio to around the first 30 seconds for the encoder, but the decoder sequence length isn't limited so it will overflow if it doesn't detect the end by the 30 second mark.
Rudimentary chunking is now implemented. Your long audio files should now work, although there is some minor transcription inaccuracy around the chunk edges. I tried incorporating the last few tokens from the previous chunk into whisper to remedy the chunk edge issues but then whisper severely repeats itself and stops predicting the end of chunks so I had to revoke that change. Any ideas why whisper is so finicky when exposed to tokens from the previous chunk?
Mint
whisper is so finicky when exposed to tokens from the previous chunk
sounds like Whisper hallucination it happens in other implementations as well. I would have to dig into this one...
OS: Mac Ventura
Seems like with the tiny model, transcription works, but when using the medium you get a buffer size error. Perhaps we could do chunking
Update
Using a six-minute audio file with the tiny model produces the same issue.