Const-me / Whisper

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model
Mozilla Public License 2.0
7.7k stars 665 forks source link

Speaker indentification #83

Open wuzimi opened 1 year ago

wuzimi commented 1 year ago

When I use Whisper to transcribe, sometimes there is a special mark" -" indicating that it is a new speaker, which means your app can identify speakers. However this is not always the case. I doubt it has something to do with the audio file. In another occasion I used the same audio file but used different model, I can get a transcript with speaker identification using small model while it disappears using base model. I guess the speaker identification ability is actually embedded in your app but not stable. Hope this can be solved in the future.

emcodem commented 11 months ago

Actually, this application (the const-me inference) has not really anything to do with any of that. What you see is the result of 680.000 hours of training existing subtitles downloaded from the internet and trained to the whisper models, the behaviour in the direction you point out is totally undefined. Speaker identification is not a feature of whisper in any means, again if you see anything pointing in that direction it is pure accident.

If you need defined behaviour for speaker separation you can try the diarize feature of the main.exe example. To identify speakers, you will need a model that has been trained for this purpose, whisper instead has been trained to do general speech-to-text purpose.