Open hbredin opened 2 years ago
Hi Hervé BREDIN,
Thank you for your question. The speaker label is only local to each file, which means 001-M of session 1 is not the same person as 001-M in session 2. So it is unreasonable to train speaker verification model using aishell-4. Diarization is OK. We do not have diarization results on aishell-4, we apoligize for it. BTW, I'd like to know which company/lab do you belong to? I'd like to learn more about speech lab in Europe and USA.
Best,
Yihui Fu
At 2021-12-02 15:50:58, "Hervé BREDIN" @.***> wrote:
Thanks for sharing this dataset!
I plan to train and evaluate pyannote speaker diarization pipelines on AISHELL-4.
I'd like to understand the speaker diarization labels better. In particular, I'd like to know if speaker labels are global to the whole dataset or only local to each file. For instance, can we assume that speaker 001-M in file 20200705_M_R002S01C01 is the same as speaker 001-M in file 20200616_M_R001S01C01? Or, are speaker labels recycled and inconsistent across files?
Are you aware of any published speaker diarization results on AISHELL-4?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
Thank you for your question. The speaker label is only local to each file, which means 001-M of session 1 is not the same person as 001-M in session 2. So it is unreasonable to train speaker verification model using aishell-4. Diarization is OK.
Thanks for your answer. Is it possible that one speaker speaks in two different sessions? If so, did you keep the list of speakers participating in each session? This information would be very useful to try and match speakers between sessions using multiple instance learning.
We do not have diarization results on aishell-4, we apoligize for it. BTW, I'd like to know which company/lab do you belong to? I'd like to learn more about speech lab in Europe and USA.
I am an academic researcher based in France
@hbredin Did you got around to running some diarization baselines on this data?
I ran VBx and spectral clustering and got the following results:
Method | MS | FA | Conf. | DER |
---|---|---|---|---|
VBx | 17.06 | 1 | 3.97 | 22.03 |
Spectral | 17.06 | 1 | 3.11 | 21.17 |
I used the pretrained sad-dihard
model from Pyannote for VAD (works very well!), and the x-vector extractor from BUT's VBx repository. As you can see, the main source of error remaining is missed speech due to overlaps. I tried using the ovl-dihard
and ovl-ami
models, but they didn't work very well (perhaps because of language mismatch). I was wondering if you have any overlap detectors trained on Mandarin?
Short answer: those things you propose are on my TODO list. Long answer: I opened a discussion on pyannote repo if you are interested in trying before I do :)
Thanks for sharing this dataset!
I plan to train and evaluate pyannote speaker diarization pipelines on AISHELL-4.
I'd like to understand the speaker diarization labels better. In particular, I'd like to know if speaker labels are global to the whole dataset or only local to each file. For instance, can we assume that speaker
001-M
in file20200705_M_R002S01C01
is the same as speaker001-M
in file20200616_M_R001S01C01
? Or, are speaker labels recycled and inconsistent across files?Are you aware of any published speaker diarization results on AISHELL-4?