After using VAD, the start and end times of the recognized segments are incorrect

zlyMaster commented 2 weeks ago

path = r"D:\Project\Python_Project\FasterWhisper\large-v3"

model = WhisperModel(model_size_or_path=path, device="cuda", local_files_only=True)

segments, info = model.transcribe("audio.wav", beam_size=5, language="zh", vad_filter=True, vad_parameters=dict(min_silence_duration_ms=1000))

When I use vad to transcribe an audio, the segment 'start' and 'end' time are incorrect. Before using VAD, the time range represented by start and end is close to or accurately corresponds to the conversation time in the audio. For example, (the time has been converted to hh: mm: ss. ms):

segment1.start=00:02:45.111
segment1.end=**00:02:46.333**
segment1.text=AAAAAAA

segment2.start=**00:02:51.222**
segment2.end=00:02:59.444
segment2.text=BBBBBB

But after using VAD, the end time is not the time when the voice ends, but is equal to the start time of the next segment. For example (the time has been converted to hh: mm: ss. ms):

segment1.start=00:02:45.111
segment1.end=**00:02:51.222**
segment1.text=AAAAAAA

segment2.start=**00:02:51.222**
segment2.end=00:02:59.444
segment2.text=BBBBBB

I can see that the end time point of segment 1 has changed to the start time point of segment 2, and I think this is a bug. To verify this, silero vad was used alone for voice recognition on the same audio file (with the same vad parameters), and the results showed that the start and end time points were close to the time point when voice appeared in the audio.

zlyMaster commented 2 weeks ago

I use faster-whisper to generate movie's subtitles, so accuracy of time is very important.Otherwise, it will affect the display of subtitles.

MahmoudAshraf97 commented 2 weeks ago

enable word timestamps for better timing accuracy, but this is not a VAD problem because whisper segment timing is not accurate in the first place, or use forced alignment for even better timings

Genesis1231 commented 1 week ago

i had a lot of trouble with VAD too, there are a few vad_parameters you can try, look for the vad.py in the package to know more about it. i remember there are settings for beginning and ending. also you can try the threshold parameter,

but because every single audio clip is different, there is no one fit all settings. if you want absolute accuracy, you might have to work on your own VAD module.

SYSTRAN / faster-whisper

After using VAD, the start and end times of the recognized segments are incorrect #1119