Open Asma-droid opened 10 months ago
I had the same problem
I just ran into this. Do you get the same results if you were to try the transcription on just a few minutes of your audio at a time?
I have an audio file that's 41 minutes long and has about 16 speakers (total), and every single speaker comes out as "Speaker 0" in the transcript. However, if I extract a 5-minute chunk of the audio file and transcribe just that chunk, I get pretty good results, with two similar-sounding speakers sometimes getting categorized together (understandable, though).
I have also noticed that I get different speaker identification in parallel vs. non-parallel diarize.py
with the five-minute clip. For my clip, the parallel version looked more correct (the actual transcript was identical between the two; it was just the speaker identification that was more wrong in the non-parallel version).
@GuyPaddock , can you upload the audio file to reproduce?
Sadly, it's a student project with audio taken from some commercial IP I can't share publicly, but if there's some private way to share it I could.
Also, thank you for the fast reply and for your work on this project! I am happy to help any way that I can.
you can upload it to a drive link and share it with me hassouna97.ma@gmail.com
Shared! Let me know if you don't receive an email from Drive.
Got it, I'm currently working om pushing some updates to this repo, I'll debug it after
Thanks!
BTW I am not using stems because I got better transcription with this particular audio file without it. I'm also using Whisper v3. So, my command line looks like this:
python diarize_parallel.py --audio ./input.wav --language en --whisper-model large-v3 --device 'cuda' --no-stem
@MahmoudAshraf97 Today I tried actually listening to the audio at the timestamps indicated in the SRT for the file that I shared with you and I'm finding that the times don't line up with the line that's been spoken. In many cases, the timestamps in the SRT are 5-15 seconds before the text that was transcribed, and might not even contain the entire line of dialogue. I wonder if that's impacting speaker attribution?
facing the same problem, is there any update? My situation is that when I use a single person's 6 seconds audio, it could be identified into 2 speakers, however, when I use a 30+ minutes and more than maybe 8 people audio, it could only identify 1 speakers which cause all speakers are 'Speaker 0'.
Diarization is not working fine for me. For some audios, all segments are identifies by Speaker 0. Are there some solutions to improve diarization quality ?