BUTSpeechFIT / VBx

Variational Bayes HMM over x-vectors diarization
251 stars 57 forks source link

using ground-truth VAD #51

Closed ZhaZhaFon closed 1 year ago

ZhaZhaFon commented 2 years ago

Hi,

Nice job ! Thanks for sharing.

I am trying to use your code to run VoxConverse. I tried to use ground-truth VAD results for clustering, but I found that VAD results in VBx/VAD/final_system is not consistent with the officially released ones. For example, an official VAD result (here) has more segments than yours (here)

How can I solve this ? Thanks.

fnlandini commented 2 years ago

Hi, thank you for the interest. The VAD labels in here should correspond to the last row in Table 1. They are from the system we submitted so there is some error with respect to the oracle ones. You can use the oracle ones if you want, in the main script you only need to switch the VAD path and the rest of the recipe should work (it will, of course, produce different results since there will be no VAD error).

ZhaZhaFon commented 2 years ago

Hi, thank you for the interest. The VAD labels in here should correspond to the last row in Table 1. They are from the system we submitted so there is some error with respect to the oracle ones. You can use the oracle ones if you want, in the main script you only need to switch the VAD path and the rest of the recipe should work (it will, of course, produce different results since there will be no VAD error).

Hi, thank you for the interest. The VAD labels in here should correspond to the last row in Table 1. They are from the system we submitted so there is some error with respect to the oracle ones. You can use the oracle ones if you want, in the main script you only need to switch the VAD path and the rest of the recipe should work (it will, of course, produce different results since there will be no VAD error).

Now I see. Thanks.

By the way, what should I do if I want to switch to a speaker encoder trained by myself, e.g. the SOTA ECAPA-TDNN. I notice that in vbhmm.py, pre-computed files for PLDA are required. How can I produce this files for the speaker encoder trained on my own ? Are there any off-the-shelf tools ? ( I am very unfamiliar with PLDA and scoring back-ends...)

Thanks

fnlandini commented 2 years ago

We have released the code for training the extractor and backend here. In particular, if you look at the script starting here you will see the instructions to train the PLDA. You might need to make some adjustments in the parameters to use with your embeddings but it should work. I hope this helps.

fnlandini commented 1 year ago

Closing due to inactivity. Feel free to reopen if you see fit.