1.0.3 VAD v5 is much worse than 1.0.2 VAD v4

zx3777 commented 4 months ago

silero-vad

Large portions of the speech are missing.

Some files have subtitles files of 10kb using version 1.0.2, while only less than 1kb using version 1.0.3.

This video file https://www.youtube.com/watch?v=tVLOBfzbJV8 resulted in 320 lines of subtitles using version 1.0.2, but only 218 lines using version 1.0.3. Many conversations were not recognized in version 1.0.3.

I only compared Korean, other languages have not been tested yet.

zx3777 commented 4 months ago

This is the audio file for the video above. https://mega.nz/file/QacS2LCJ#x_Gq9GgV8aPk2qRVskfzNBuyM9XAI-Pv2SBIwxfomnk

x86Gr commented 4 months ago

I agree, I also have worse performance, just not as much, however the overall WER for non english speech is going down. Go back to silero or at least let us choose the VAD model

zx3777 commented 4 months ago

I agree, I also have worse performance, just not as much, however the . Go back to silero or at least let us choose the VAD model

Version 1.0.3 release still uses silero, but with an upgraded version. WER going down maybe because the VAD only identifies sufficiently clear speech.

MahmoudAshraf97 commented 4 months ago

@zx3777 that will cause higher WER, a missing word is still an error to count You should try playing with the vad settings and see how it makes a difference, the model was changed but the parameters are still tuned for the previous one

zx3777 commented 4 months ago

@zx3777 that will cause higher WER, a missing word is still an error to count You should try playing with the vad settings and see how it makes a difference, the model was changed but the parameters are still tuned for the previous one

Useless

I tried --vad_threshold 0.4 0.3 0.2 in 1.0.3, and there was a slight improvement, but the recognized subtitles are still much less than in 1.0.2.

hoonlight commented 4 months ago

Hi, could you try again with the master branch and let me know the results?

x86Gr commented 4 months ago

I will run the tests on our audio corporas, with different parameters, but it won't be quick

zx3777 commented 3 months ago

Hi, could you try again with the master branch and let me know the results?

I tested the master branch version before the upgrade to [New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements], and the results were the same.

In my opinion, after the new PR, only the batched version uses a different VAD implementation. The normal version still uses the VAD from 1.03, so the results should be the same.

hoonlight commented 3 months ago

Thanks for the test @zx3777 , I suspect this is a issue with the model itself. There hasn't been enough quantitative evaluation of the silero-vad v5, but at least we can make it possible for users to choose silero-vad v4 instead of silero-vad v5 based on their needs.

I'll open a PR after the issues related to this discussion are well finalized.

MahmoudAshraf97 commented 3 months ago

Thanks for the test @zx3777 , I suspect this is a issue with the model itself. There hasn't been enough quantitative evaluation of the silero-vad v5, but at least we can make it possible for users to choose silero-vad v4 instead of silero-vad v5 based on their needs.

I'll open a PR after the issues related to this discussion are well finalized.

I already wrote the code, but waiting for #936 to be merged so we can discuss having both or just reverting to V4

George0828Zhang commented 1 month ago

Just chiming in and adding a case where old (not sure if it's v3 or v4) version outperforms v5: https://drive.google.com/file/d/1NPvEybP0VU1dFmd6neH6JJRW_Qm2MXdk/view?usp=sharing

code:

from pprint import pprint
from faster_whisper.audio import decode_audio
from faster_whisper.vad import VadOptions, get_speech_timestamps

speech_chunks = get_speech_timestamps(decode_audio('ja_example.wav'))
pprint(speech_chunks)

old:

[{'end': 40192, 'start': 12032},
 {'end': 179456, 'start': 76544},
 {'end': 379136, 'start': 273152},
 {'end': 457984, 'start': 422656},
 {'end': 630016, 'start': 576256},
 {'end': 669952, 'start': 653056},
 {'end': 863488, 'start': 695040},
 {'end': 950528, 'start': 896768}]

v5:

[{'end': 30464, 'start': 12032}]

Apparently cartoony voices are ignored in v5.

George0828Zhang commented 2 weeks ago

Thanks for the test @zx3777 , I suspect this is a issue with the model itself. There hasn't been enough quantitative evaluation of the silero-vad v5, but at least we can make it possible for users to choose silero-vad v4 instead of silero-vad v5 based on their needs. I'll open a PR after the issues related to this discussion are well finalized.

I already wrote the code, but waiting for #936 to be merged so we can discuss having both or just reverting to V4

Hi @MahmoudAshraf97 , since the PR is merged, is it time to have this discussion?

MahmoudAshraf97 commented 2 weeks ago

Since I'm the maintainer now, I guess we should stick to V5 although it might introduce some edge cases, unless there are solid benchmarks on how different silero versions affect WER, I would vote on including V5 only and users have the option to revert to V4 by modifying the code manually

AdrienVinches commented 1 week ago

I conducted some benchmarks and wanted to share my findings regarding the VAD performance in faster-whisper v1.0.3. From my observations, the VAD in v1.0.3 (silero-vad v5) is significantly more aggressive compared to v1.0.2 (silero-vad v4). Below are the results for the exact same set of audio files, using the same model:

faster-whisper 1.0.2 (using base.en model on vav files): Total duration: 20:00:44.07 Total duration after VAD: 09:10:05.27

Average recording duration: 00:05:23.07 Average recording duration after VAD: 00:02:28.01

Average VAD reduction percentage: 52.92 %

faster-whisper 1.0.3 (using base.en model on vav files): Total duration: 20:00:44.07 Total duration after VAD: 04:27:53.65

Average recording duration: 00:05:23.07 Average recording duration after VAD: 00:01:12.08

Average VAD reduction percentage: 76.47 %

In addition to this analysis, we’ve used faster-whisper extensively for a project where we transcribed over 1000 hours of audio. I would be happy to share more detailed results with you in a call and even demonstrate our UI to showcase the stark difference in the number of interactions generated between v1.0.2 and v1.0.3.

Overall, v1.0.2 is performing much better than v1.0.3 in terms of balancing effective VAD and retaining useful audio. In some cases, v1.0.3 reduced recordings to 0 seconds after VAD, whereas v1.0.2 preserved over a minute of audio for the same files.

I hope this feedback helps improve future releases. Let me know if you’d like to discuss this further or review additional data.

MahmoudAshraf97 commented 1 week ago

Thanks for the data, have you tried tuning the parameters and see if it makes any difference? mainly onset and offset, also duration after vad isn't a useful metric, WER would be much more helpful as we need to see the effect of vad on the final result

SYSTRAN / faster-whisper

1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 #934

faster-whisper 1.0.3 (using base.en model on vav files): Total duration: 20:00:44.07 Total duration after VAD: 04:27:53.65

Average recording duration: 00:05:23.07 Average recording duration after VAD: 00:01:12.08