felixfuyihui / AISHELL-4

Apache License 2.0
112 stars 25 forks source link

About speaker diarization on AISHELL-4 #7

Open hbredin opened 2 years ago

hbredin commented 2 years ago

Thanks for sharing this dataset!

I plan to train and evaluate pyannote speaker diarization pipelines on AISHELL-4.

  1. I'd like to understand the speaker diarization labels better. In particular, I'd like to know if speaker labels are global to the whole dataset or only local to each file. For instance, can we assume that speaker 001-M in file 20200705_M_R002S01C01 is the same as speaker 001-M in file 20200616_M_R001S01C01? Or, are speaker labels recycled and inconsistent across files?

  2. Are you aware of any published speaker diarization results on AISHELL-4?

felixfuyihui commented 2 years ago

Hi Hervé BREDIN,

Thank you for your question. The speaker label is only local to each file, which means 001-M of session 1 is not the same person as 001-M in session 2. So it is unreasonable to train speaker verification model using aishell-4. Diarization is OK. We do not have diarization results on aishell-4, we apoligize for it. BTW, I'd like to know which company/lab do you belong to? I'd like to learn more about speech lab in Europe and USA.

Best,

Yihui Fu

At 2021-12-02 15:50:58, "Hervé BREDIN" @.***> wrote:

Thanks for sharing this dataset!

I plan to train and evaluate pyannote speaker diarization pipelines on AISHELL-4.

I'd like to understand the speaker diarization labels better. In particular, I'd like to know if speaker labels are global to the whole dataset or only local to each file. For instance, can we assume that speaker 001-M in file 20200705_M_R002S01C01 is the same as speaker 001-M in file 20200616_M_R001S01C01? Or, are speaker labels recycled and inconsistent across files?

Are you aware of any published speaker diarization results on AISHELL-4?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

hbredin commented 2 years ago

Thank you for your question. The speaker label is only local to each file, which means 001-M of session 1 is not the same person as 001-M in session 2. So it is unreasonable to train speaker verification model using aishell-4. Diarization is OK.

Thanks for your answer. Is it possible that one speaker speaks in two different sessions? If so, did you keep the list of speakers participating in each session? This information would be very useful to try and match speakers between sessions using multiple instance learning.

We do not have diarization results on aishell-4, we apoligize for it. BTW, I'd like to know which company/lab do you belong to? I'd like to learn more about speech lab in Europe and USA.

I am an academic researcher based in France

desh2608 commented 2 years ago

@hbredin Did you got around to running some diarization baselines on this data?

I ran VBx and spectral clustering and got the following results:

Method MS FA Conf. DER
VBx 17.06 1 3.97 22.03
Spectral 17.06 1 3.11 21.17

I used the pretrained sad-dihard model from Pyannote for VAD (works very well!), and the x-vector extractor from BUT's VBx repository. As you can see, the main source of error remaining is missed speech due to overlaps. I tried using the ovl-dihard and ovl-ami models, but they didn't work very well (perhaps because of language mismatch). I was wondering if you have any overlap detectors trained on Mandarin?

hbredin commented 2 years ago

Short answer: those things you propose are on my TODO list. Long answer: I opened a discussion on pyannote repo if you are interested in trying before I do :)