SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
10.2k stars 858 forks source link

enable vad_filter cause timestamp mismatch #230

Open iorilu opened 1 year ago

iorilu commented 1 year ago

I tested some videos

if the silence duration is long , then enable vad_filter will be effective

but if video is as normal, then enable vad_filter may cause more timestamp mismatch

is there a good solution to apply to all videos(audios) , just try to make timestamp as accurate as possible

guillaumekln commented 1 year ago

Is it possible for you to share the video/audio where you see the issue?

iorilu commented 1 year ago

it's not some public video , i can't share it now

but maybe i will try some youtube videos later and see if similar issue will happen

albert-id commented 1 year ago

Same thing. With vad_filter=True words timestamps in some segments are stretched out for the speech segment duration.

phineas-pta commented 1 year ago

here an example: https://www.youtube.com/watch?v=B0kAq2HxdmE

transcription command:

import faster_whisper
model = faster_whisper.WhisperModel("large-v2", device="cuda")

segments, info = model.transcribe(
    "audio.webm", language="vi", vad_filter=True,
    vad_parameters={"max_speech_duration_s": 15}
)

... # to .srt

result:

image

liwangd commented 10 months ago

here an example: https://www.youtube.com/watch?v=B0kAq2HxdmE

transcription command:

import faster_whisper
model = faster_whisper.WhisperModel("large-v2", device="cuda")

segments, info = model.transcribe(
    "audio.webm", language="vi", vad_filter=True,
    vad_parameters={"max_speech_duration_s": 15}
)

... # to .srt

result:

image

I am seeing the same issue. If the audio starts with music (say for 40s), the start time of the first segment often includes the music (e.g. starts from 0s, rather 41s). I also noticed that with word_timestamps enabled, this problem can be alleviated.

jet082 commented 1 month ago

Came here to confirm the same problem. word_timestamps does seem to help though.