Closed AdolfVonKleist closed 1 year ago
Minor update: anecdotally it appears that tweaking the thresholds on the silerovad to generate a smaller number of longer segments resolves the problem in most cases.
Hi @AdolfVonKleist Sorry for the delay. First of all, I'd like to say that the energy-based VAD shared here is just for the sake of having some simple model that could be used (since VAD is necessary with this type of diarization framework). However, I would not expect great results and it makes sense to me that you obtain better results with other VADs. Having said that, it looks to me that SileroVAD's output in your example has many short segments. The way VBx works, it takes each speech segment and extracts x-vectors from them. If the segments are too short (i.e. less than 1 second), the quality of the embeddings will not be very good; therefore, the final output can be quite bad. A worse VAD (if we only evaluate the quality in terms of VAD) that produces longer segments will have, perhaps, more false alarm but allow for better speaker embeddings resulting in better diarization. I have not analyzed these effect in particular but it looks like having longer speech segments allowed you to improve the performance so it could be because of this. If you are interested in evaluating a model in terms of VAD, you could use this script https://github.com/BUTSpeechFIT/diarization_utils/blob/main/score_vad.py You need to pass reference and system RTTMs and it will calculate a few metrics to evaluate VAD. This might be useful to see the relationship between the VAD errors and the diarization ones, in case you were interested in analyzing that.
Federico
@fnlandini thanks for this detailed feedback!
I'd like to say that the energy-based VAD shared here is just for the sake of having some simple model that could be used
Definitely and it was a great starting point! That's exactly why I started looking into other potential options.
Having said that, it looks to me that SileroVAD's output in your example has many short segments. The way VBx works, it takes each speech segment and extracts x-vectors from them. If the segments are too short (i.e. less than 1 second), the quality of the embeddings will not be very good; therefore, the final output can be quite bad.
Great this is exactly what my hypothesis was, I'm glad my observations match your expectation from the theory and implementation.
I have setup an implementation of the silerovad that allows to specify this minimal gap as a kind of 'epsilon'. It looks like setting this between 0.1s and 0.2s produces a good improvement. I'll try to share the silerovad VAD wrapper as I have already configured it to export it in the .lab
format used by VBx.
in case you were interested in analyzing that.
It would definitely be interesting to further analyze it. I was quite surprised as first with the initial results; I had naively expected that simply improving the quality of the VAD decisions would lead to an improvement in overall diarization quality. Thanks again for the feedback.
I have been getting some great results with this library; it's especially fantastic in terms of the trade off between accuracy and compute speed on CPU-only setups. I have a question about sensitivity to the VAD and resulting .lab files. I thought I might improve the results a bit by switching to a more robust VAD, and tried slotting in:
in general I would say this tends to be more accurate and more fine-grained in terms of its decisions, and is considerably more robust than the vanilla energy VAD example here (although this also works quite well):
however on some files the differences are quite extreme and I wonder if you could provide some insight into this perhaps. For example the following is a .lab result from a 2 min file based on the energy VAD:
Energy VAD results:
and the next is the result from silerovad using the exact same file as input: Silero VAD results:
The diarization is then carried out using the exact same command and input and configuration, with the only difference being these 2 .lab files. I would expect the diarization results to differ in some respect, however the energy VAD produces a reasonable estimate, while the silerovad based .lab output results in just a single segment and speaker. I would also add that the silerovad output is, in this case and IMO more accurate, not less:
is this sort of 'pathalogical' difference something that just cannot be avoided? In addition, I noticed that if I take the same file and concatenate it a few times with sox,
sox test/shorter_mono_8k.wav test/shorter_mono_8k.wav test/longer_mono_8k.wav
and then perform this same experiment on the longer file it works fine with both VADs. Or would I maybe get better results by fiddling with the silerovad thresholds to create more, longer speech segments, or maybe some other idea? Thanks for your thoughts!