Open zx3777 opened 4 months ago
This is the audio file for the video above. https://mega.nz/file/QacS2LCJ#x_Gq9GgV8aPk2qRVskfzNBuyM9XAI-Pv2SBIwxfomnk
I agree, I also have worse performance, just not as much, however the overall WER for non english speech is going down. Go back to silero or at least let us choose the VAD model
I agree, I also have worse performance, just not as much, however the . Go back to silero or at least let us choose the VAD model
Version 1.0.3 release still uses silero, but with an upgraded version. WER going down maybe because the VAD only identifies sufficiently clear speech.
@zx3777 that will cause higher WER, a missing word is still an error to count You should try playing with the vad settings and see how it makes a difference, the model was changed but the parameters are still tuned for the previous one
@zx3777 that will cause higher WER, a missing word is still an error to count You should try playing with the vad settings and see how it makes a difference, the model was changed but the parameters are still tuned for the previous one
Useless
I tried --vad_threshold 0.4 0.3 0.2
in 1.0.3, and there was a slight improvement, but the recognized subtitles are still much less than in 1.0.2.
Hi, could you try again with the master branch and let me know the results?
I will run the tests on our audio corporas, with different parameters, but it won't be quick
Hi, could you try again with the master branch and let me know the results?
I tested the master branch version before the upgrade to [New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements], and the results were the same.
In my opinion, after the new PR, only the batched version uses a different VAD implementation. The normal version still uses the VAD from 1.03, so the results should be the same.
Thanks for the test @zx3777 , I suspect this is a issue with the model itself. There hasn't been enough quantitative evaluation of the silero-vad v5, but at least we can make it possible for users to choose silero-vad v4 instead of silero-vad v5 based on their needs.
I'll open a PR after the issues related to this discussion are well finalized.
Thanks for the test @zx3777 , I suspect this is a issue with the model itself. There hasn't been enough quantitative evaluation of the silero-vad v5, but at least we can make it possible for users to choose silero-vad v4 instead of silero-vad v5 based on their needs.
I'll open a PR after the issues related to this discussion are well finalized.
I already wrote the code, but waiting for #936 to be merged so we can discuss having both or just reverting to V4
Just chiming in and adding a case where old (not sure if it's v3 or v4) version outperforms v5: https://drive.google.com/file/d/1NPvEybP0VU1dFmd6neH6JJRW_Qm2MXdk/view?usp=sharing
code:
from pprint import pprint
from faster_whisper.audio import decode_audio
from faster_whisper.vad import VadOptions, get_speech_timestamps
speech_chunks = get_speech_timestamps(decode_audio('ja_example.wav'))
pprint(speech_chunks)
old:
[{'end': 40192, 'start': 12032},
{'end': 179456, 'start': 76544},
{'end': 379136, 'start': 273152},
{'end': 457984, 'start': 422656},
{'end': 630016, 'start': 576256},
{'end': 669952, 'start': 653056},
{'end': 863488, 'start': 695040},
{'end': 950528, 'start': 896768}]
v5:
[{'end': 30464, 'start': 12032}]
Apparently cartoony voices are ignored in v5.
Thanks for the test @zx3777 , I suspect this is a issue with the model itself. There hasn't been enough quantitative evaluation of the silero-vad v5, but at least we can make it possible for users to choose silero-vad v4 instead of silero-vad v5 based on their needs. I'll open a PR after the issues related to this discussion are well finalized.
I already wrote the code, but waiting for #936 to be merged so we can discuss having both or just reverting to V4
Hi @MahmoudAshraf97 , since the PR is merged, is it time to have this discussion?
Since I'm the maintainer now, I guess we should stick to V5 although it might introduce some edge cases, unless there are solid benchmarks on how different silero versions affect WER, I would vote on including V5 only and users have the option to revert to V4 by modifying the code manually
I conducted some benchmarks and wanted to share my findings regarding the VAD performance in faster-whisper v1.0.3. From my observations, the VAD in v1.0.3 (silero-vad v5) is significantly more aggressive compared to v1.0.2 (silero-vad v4). Below are the results for the exact same set of audio files, using the same model:
faster-whisper 1.0.2 (using base.en model on vav files): Total duration: 20:00:44.07 Total duration after VAD: 09:10:05.27
Average recording duration: 00:05:23.07 Average recording duration after VAD: 00:02:28.01
Average VAD reduction percentage: 52.92 %
Average VAD reduction percentage: 76.47 %
In addition to this analysis, we’ve used faster-whisper extensively for a project where we transcribed over 1000 hours of audio. I would be happy to share more detailed results with you in a call and even demonstrate our UI to showcase the stark difference in the number of interactions generated between v1.0.2 and v1.0.3.
Overall, v1.0.2 is performing much better than v1.0.3 in terms of balancing effective VAD and retaining useful audio. In some cases, v1.0.3 reduced recordings to 0 seconds after VAD, whereas v1.0.2 preserved over a minute of audio for the same files.
I hope this feedback helps improve future releases. Let me know if you’d like to discuss this further or review additional data.
Thanks for the data, have you tried tuning the parameters and see if it makes any difference? mainly onset and offset, also duration after vad isn't a useful metric, WER would be much more helpful as we need to see the effect of vad on the final result
silero-vad
Large portions of the speech are missing.
Some files have subtitles files of 10kb using version 1.0.2, while only less than 1kb using version 1.0.3.
This video file https://www.youtube.com/watch?v=tVLOBfzbJV8 resulted in 320 lines of subtitles using version 1.0.2, but only 218 lines using version 1.0.3. Many conversations were not recognized in version 1.0.3.
I only compared Korean, other languages have not been tested yet.