Question about impact of VAD .lab file and granularity

AdolfVonKleist commented 1 year ago

I have been getting some great results with this library; it's especially fantastic in terms of the trade off between accuracy and compute speed on CPU-only setups. I have a question about sensitivity to the VAD and resulting .lab files. I thought I might improve the results a bit by switching to a more robust VAD, and tried slotting in:

https://github.com/snakers4/silero-vad

in general I would say this tends to be more accurate and more fine-grained in terms of its decisions, and is considerably more robust than the vanilla energy VAD example here (although this also works quite well):

https://github.com/BUTSpeechFIT/VBx/blob/7103b76a32dec65239d63fd8047496a71dd0602e/VAD/energy_VAD.py

however on some files the differences are quite extreme and I wonder if you could provide some insight into this perhaps. For example the following is a .lab result from a 2 min file based on the energy VAD:

Energy VAD results:

1.880   2.380   speech
3.600   5.340   speech
7.170   8.720   speech
11.150  13.050  speech
13.460  20.310  speech
20.620  23.830  speech
24.240  30.330  speech
30.400  33.290  speech
33.790  34.360  speech
35.360  40.400  speech
40.870  41.340  speech
41.870  45.330  speech
45.580  50.290  speech
50.550  52.380  speech
52.690  53.280  speech
53.500  58.980  speech
59.140  66.700  speech
67.110  69.470  speech
69.610  70.450  speech
70.990  72.210  speech
72.690  75.240  speech
75.590  79.870  speech
81.120  81.570  speech
81.740  82.480  speech
82.970  84.270  speech
85.250  86.480  speech
89.880  90.420  speech
90.610  94.750  speech
95.390  96.160  speech
96.920  99.010  speech
99.480  102.230 speech
102.990 105.770 speech
105.960 106.750 speech
107.570 108.910 speech
109.320 111.450 speech
111.720 113.300 speech
113.440 114.430 speech
114.790 115.490 speech
115.590 116.090 speech
116.620 119.980 speech

and the next is the result from silerovad using the exact same file as input: Silero VAD results:

1.536   2.368   speech
3.840   5.376   speech
5.568   6.528   speech
6.528   6.848   speech
7.360   8.704   speech
11.264  13.056  speech
13.568  16.576  speech
16.768  17.472  speech
17.536  19.456  speech
19.520  20.352  speech
20.800  22.656  speech
22.656  23.872  speech
24.448  26.432  speech
26.624  27.712  speech
27.968  29.952  speech
29.952  30.272  speech
30.592  33.280  speech
33.792  33.920  speech
33.984  34.368  speech
35.520  37.568  speech
37.760  39.040  speech
39.104  40.448  speech
40.576  41.536  speech
42.048  42.496  speech
42.560  45.312  speech
45.760  47.936  speech
48.128  49.280  speech
49.344  50.432  speech
50.752  52.352  speech
52.544  52.736  speech
52.928  53.248  speech
53.696  54.080  speech
54.208  56.640  speech
56.640  58.048  speech
58.112  58.944  speech
59.328  60.928  speech
60.928  61.824  speech
61.824  62.464  speech
62.528  63.744  speech
63.808  64.512  speech
64.512  66.880  speech
67.328  68.032  speech
68.032  68.672  speech
68.800  69.376  speech
69.824  70.400  speech
71.168  71.808  speech
71.872  72.256  speech
72.832  73.408  speech
73.600  74.240  speech
74.304  75.200  speech
75.776  76.928  speech
76.928  79.808  speech
81.280  82.432  speech
83.136  84.288  speech
85.440  86.464  speech
89.856  90.624  speech
90.624  92.288  speech
92.544  93.120  speech
93.120  94.848  speech
95.616  96.128  speech
97.152  99.008  speech
99.648  100.096 speech
100.224 100.608 speech
100.800 102.208 speech
103.232 104.512 speech
104.512 105.984 speech
106.048 106.816 speech
107.712 108.288 speech
108.416 108.864 speech
109.312 110.336 speech
110.336 110.784 speech
110.848 111.424 speech
111.936 113.216 speech
113.664 114.432 speech
114.560 116.288 speech
116.800 119.980 speech

The diarization is then carried out using the exact same command and input and configuration, with the only difference being these 2 .lab files. I would expect the diarization results to differ in some respect, however the energy VAD produces a reasonable estimate, while the silerovad based .lab output results in just a single segment and speaker. I would also add that the silerovad output is, in this case and IMO more accurate, not less:

{
  "filename": "test/shorter_mono_8k.wav",
  "diarization": [
    [
      {
        "start": "1.54",
        "end": "120.00",
        "spkr": "2"
      }
    ]
  ]
}

is this sort of 'pathalogical' difference something that just cannot be avoided? In addition, I noticed that if I take the same file and concatenate it a few times with sox, sox test/shorter_mono_8k.wav test/shorter_mono_8k.wav test/longer_mono_8k.wav and then perform this same experiment on the longer file it works fine with both VADs. Or would I maybe get better results by fiddling with the silerovad thresholds to create more, longer speech segments, or maybe some other idea? Thanks for your thoughts!

AdolfVonKleist commented 1 year ago

Minor update: anecdotally it appears that tweaking the thresholds on the silerovad to generate a smaller number of longer segments resolves the problem in most cases.

fnlandini commented 1 year ago

Hi @AdolfVonKleist Sorry for the delay. First of all, I'd like to say that the energy-based VAD shared here is just for the sake of having some simple model that could be used (since VAD is necessary with this type of diarization framework). However, I would not expect great results and it makes sense to me that you obtain better results with other VADs. Having said that, it looks to me that SileroVAD's output in your example has many short segments. The way VBx works, it takes each speech segment and extracts x-vectors from them. If the segments are too short (i.e. less than 1 second), the quality of the embeddings will not be very good; therefore, the final output can be quite bad. A worse VAD (if we only evaluate the quality in terms of VAD) that produces longer segments will have, perhaps, more false alarm but allow for better speaker embeddings resulting in better diarization. I have not analyzed these effect in particular but it looks like having longer speech segments allowed you to improve the performance so it could be because of this. If you are interested in evaluating a model in terms of VAD, you could use this script https://github.com/BUTSpeechFIT/diarization_utils/blob/main/score_vad.py You need to pass reference and system RTTMs and it will calculate a few metrics to evaluate VAD. This might be useful to see the relationship between the VAD errors and the diarization ones, in case you were interested in analyzing that.

Federico

AdolfVonKleist commented 1 year ago

@fnlandini thanks for this detailed feedback!

I'd like to say that the energy-based VAD shared here is just for the sake of having some simple model that could be used

Definitely and it was a great starting point! That's exactly why I started looking into other potential options.

Having said that, it looks to me that SileroVAD's output in your example has many short segments. The way VBx works, it takes each speech segment and extracts x-vectors from them. If the segments are too short (i.e. less than 1 second), the quality of the embeddings will not be very good; therefore, the final output can be quite bad.

Great this is exactly what my hypothesis was, I'm glad my observations match your expectation from the theory and implementation.

I have setup an implementation of the silerovad that allows to specify this minimal gap as a kind of 'epsilon'. It looks like setting this between 0.1s and 0.2s produces a good improvement. I'll try to share the silerovad VAD wrapper as I have already configured it to export it in the .lab format used by VBx.

in case you were interested in analyzing that.

It would definitely be interesting to further analyze it. I was quite surprised as first with the initial results; I had naively expected that simply improving the quality of the VAD decisions would lead to an improvement in overall diarization quality. Thanks again for the feedback.

BUTSpeechFIT / VBx

Question about impact of VAD .lab file and granularity #60