MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
3.69k stars 325 forks source link

Diarization is not working fine for all audios #147

Open Asma-droid opened 10 months ago

Asma-droid commented 10 months ago

Diarization is not working fine for me. For some audios, all segments are identifies by Speaker 0. Are there some solutions to improve diarization quality ?

Yjppj commented 10 months ago

I had the same problem

GuyPaddock commented 3 months ago

I just ran into this. Do you get the same results if you were to try the transcription on just a few minutes of your audio at a time?

I have an audio file that's 41 minutes long and has about 16 speakers (total), and every single speaker comes out as "Speaker 0" in the transcript. However, if I extract a 5-minute chunk of the audio file and transcribe just that chunk, I get pretty good results, with two similar-sounding speakers sometimes getting categorized together (understandable, though).

I have also noticed that I get different speaker identification in parallel vs. non-parallel diarize.py with the five-minute clip. For my clip, the parallel version looked more correct (the actual transcript was identical between the two; it was just the speaker identification that was more wrong in the non-parallel version).

MahmoudAshraf97 commented 3 months ago

@GuyPaddock , can you upload the audio file to reproduce?

GuyPaddock commented 3 months ago

Sadly, it's a student project with audio taken from some commercial IP I can't share publicly, but if there's some private way to share it I could.

GuyPaddock commented 3 months ago

Also, thank you for the fast reply and for your work on this project! I am happy to help any way that I can.

MahmoudAshraf97 commented 3 months ago

you can upload it to a drive link and share it with me hassouna97.ma@gmail.com

GuyPaddock commented 3 months ago

Shared! Let me know if you don't receive an email from Drive.

MahmoudAshraf97 commented 3 months ago

Got it, I'm currently working om pushing some updates to this repo, I'll debug it after

GuyPaddock commented 3 months ago

Thanks!

GuyPaddock commented 3 months ago

BTW I am not using stems because I got better transcription with this particular audio file without it. I'm also using Whisper v3. So, my command line looks like this:

python diarize_parallel.py --audio ./input.wav --language en --whisper-model large-v3 --device 'cuda' --no-stem
GuyPaddock commented 2 months ago

@MahmoudAshraf97 Today I tried actually listening to the audio at the timestamps indicated in the SRT for the file that I shared with you and I'm finding that the times don't line up with the line that's been spoken. In many cases, the timestamps in the SRT are 5-15 seconds before the text that was transcribed, and might not even contain the entire line of dialogue. I wonder if that's impacting speaker attribution?

Chi8wah commented 1 week ago

facing the same problem, is there any update? My situation is that when I use a single person's 6 seconds audio, it could be identified into 2 speakers, however, when I use a 30+ minutes and more than maybe 8 people audio, it could only identify 1 speakers which cause all speakers are 'Speaker 0'.