MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
2.53k stars 243 forks source link

diarization issue: All dialouges got speaker 0 only. #113

Closed manjunath7472 closed 3 months ago

manjunath7472 commented 9 months ago

Below is audio file to reproduce the issue. Audio.

Actual output.

Speaker Name,in,out,Text

Speaker 0,00:04:41.4,00:07:32.15,You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You I will give you my feedback.

Speaker 0,00:07:36.0,00:07:36.7,Okay.

Speaker 0,00:07:36.8,00:07:36.22,"All right, dear."

Speaker 0,00:07:37.4,00:07:38.21,So let's start with today's class.

Speaker 0,00:07:39.7,00:07:54.15,"And we are going to do C six today and C five you did with some other teacher, right?"

Speaker 0,00:07:55.14,00:07:55.19,Yeah.

Speaker 0,00:07:56.13,00:07:56.22,Okay.

Speaker 0,00:07:57.10,00:08:00.22,"Yeah, because I was on well, so I canceled the class."

Speaker 0,00:08:00.23,00:08:02.7,So you did it with the other teacher.

Speaker 0,00:08:02.18,00:08:02.24,Yes.

Speaker 0,00:08:04.10,00:08:05.1,"Okay, great."

Speaker 0,00:08:05.7,00:08:06.9,So you understood that?

Speaker 0,00:08:08.12,00:08:10.11,Can you tell me you understood that?

Speaker 0,00:08:10.13,00:08:13.19,Can you tell me what concept did you learn in the last class?

Speaker 0,00:08:14.10,00:08:17.5,"Yeah, I didn't understand it."

Speaker 0,00:08:22.15,00:08:23.19,You didn't understand that?

Speaker 0,00:08:24.16,00:08:25.19,I understand it.

Speaker 0,00:08:26.7,00:08:27.21,"Okay, so what was it?"

Speaker 0,00:08:28.2,00:08:31.4,Can you tell me which game you created in that class?

Speaker 0,00:08:32.11,00:08:34.11,Chasing the mouse.

Speaker 0,00:08:34.16,00:08:38.7,"Oh, that's an interesting game."

Speaker 0,00:08:38.8,00:08:38.11,Yes.

Speaker 0,00:08:49.1,00:08:49.6,Good.

Speaker 0,00:08:49.7,00:08:49.23,Fantastic.
v-nhandt21 commented 9 months ago

Yep, I got the same error, have you found the issue?

solucionesuno commented 8 months ago

Yet not solution?

rbracco commented 8 months ago

I have the same issue and investigated. It appears that the "speaker 0" for all lines is the direct output of the underlying diarization model, Nemo Toolkits: nemo.collections.asr.models.msdd_models.NeuralDiarizer. So there is a bug in the nemo toolkit, not this library. We all might be better off trying to use pyannote for the diarization.

v-nhandt21 commented 8 months ago

I found that the problem come from model quality of Nemo: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/diar_msdd_telephonic

Since this model is only trained on telephonic speech, diarization performance on other acoustic conditions might show a degraded performance compared to telephonic speech.

kalisgd0 commented 7 months ago

Had the same with demucs. Disabling it (--no-stem) helped.