MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
2.53k stars 243 forks source link

The Diarization does not work #99

Closed projects-g closed 2 months ago

projects-g commented 9 months ago

I fed it a sample audio of 3 mins and 30 mins long. For both the audio inputs (.mp3), I get the same diarization result where all the text is attributed to a single speaker.

1
00:00:04,540 --> 30:03:25,791
Speaker 0: 
manjunath7472 commented 9 months ago

saw this issue when input audio has cut abruptly. I tried extending and it worked. Have lots of glitches in this process. What is your audio length? Did u try diarizing for bigger length audio file?

manjunath7472 commented 9 months ago

In transcribe cell, we need to remove below lines to avoid kernel crash while transcribing. It also solves problem where all dialogues are stuffed under one speaker.

del whisper_model torch.cuda.empty_cache()

projects-g commented 9 months ago

@manjunath7472 Yes, I have done that. Another question I have is that according to the configurations provided, there can be 3 types: meeting, telephonic, general but the msdd model for meeting & general is actually set to None. So, we cant use these configurations, means we can predict only 2 speakers. The msdd model with a valid path exists only for telephonic type : "diar_msdd_telephonic" .

Can you add more info on this ?

MahmoudAshraf97 commented 9 months ago

@projects-g , @manjunath7472 can you provide me the audio file to reproduce this issue?

the following lines just clear the gpu memory for the following steps, they have absolutely no effect on the results

del whisper_model
torch.cuda.empty_cache()
projects-g commented 9 months ago

@MahmoudAshraf97 I understood that they don't have any effect other than on memory. I cannot share the file as it is huge. Will try to test with a another ~10 min input file as my initial one was only 2 minutes.

But, could be add any info about another part of my question ? About the msdd_model available only for the "telephonic" type and not for the other two (general, meeting). If one were to use the "meeting" or "general" type for diarization, how would one go about it ?

manjunath7472 commented 9 months ago

Below is requested audio file with same issue. Initially with default settings in transcribe() it attributes all dialouges to single speaker. Audio Then i added below to transcribe() and it transcribes fine.

vad_parameters=dict(threshold=0.4, max_speech_duration_s=15)

But diarization didn't cluster anything and it just labelled as speaker 0 for all dialouges.

Short Result below.

Speaker Name,in,out,Text

Speaker 0,00:04:41.4,00:07:32.15,You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You I will give you my feedback.

Speaker 0,00:07:36.0,00:07:36.7,Okay.

Speaker 0,00:07:36.8,00:07:36.22,"All right, dear."

Speaker 0,00:07:37.4,00:07:38.21,So let's start with today's class.

Speaker 0,00:07:39.7,00:07:54.15,"And we are going to do C six today and C five you did with some other teacher, right?"

Speaker 0,00:07:55.14,00:07:55.19,Yeah.

Speaker 0,00:07:56.13,00:07:56.22,Okay.

Speaker 0,00:07:57.10,00:08:00.22,"Yeah, because I was on well, so I canceled the class."

Speaker 0,00:08:00.23,00:08:02.7,So you did it with the other teacher.

Speaker 0,00:08:02.18,00:08:02.24,Yes.

Speaker 0,00:08:04.10,00:08:05.1,"Okay, great."

Speaker 0,00:08:05.7,00:08:06.9,So you understood that?

Speaker 0,00:08:08.12,00:08:10.11,Can you tell me you understood that?

Speaker 0,00:08:10.13,00:08:13.19,Can you tell me what concept did you learn in the last class?

Speaker 0,00:08:14.10,00:08:17.5,"Yeah, I didn't understand it."

Speaker 0,00:08:22.15,00:08:23.19,You didn't understand that?

Speaker 0,00:08:24.16,00:08:25.19,I understand it.
v-nhandt21 commented 9 months ago

image

Since this model is only trained on telephonic speech, diarization performance on other acoustic conditions might show a degraded performance compared to telephonic speech.

I found the model from the NGC of Nvidia: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/diar_msdd_telephonic

For the generic and meeting, there is no model support for this.

I tried to clone the Nemo package and then print the prediction of the MSDD model, but it seems that the misprediction came from Nvidia model, not the implementation of the repo's author,

image

Anyway, I am trying to find out which audio is suitable for this model.

Train for msdd seem to be hard core :))

Asma-droid commented 7 months ago

I have the same issue for long files! any idea please to solve the problem ?