Open iorilu opened 1 year ago
Is it possible for you to share the video/audio where you see the issue?
it's not some public video , i can't share it now
but maybe i will try some youtube videos later and see if similar issue will happen
Same thing. With vad_filter=True words timestamps in some segments are stretched out for the speech segment duration.
here an example: https://www.youtube.com/watch?v=B0kAq2HxdmE
transcription command:
import faster_whisper
model = faster_whisper.WhisperModel("large-v2", device="cuda")
segments, info = model.transcribe(
"audio.webm", language="vi", vad_filter=True,
vad_parameters={"max_speech_duration_s": 15}
)
... # to .srt
result:
here an example: https://www.youtube.com/watch?v=B0kAq2HxdmE
transcription command:
import faster_whisper model = faster_whisper.WhisperModel("large-v2", device="cuda") segments, info = model.transcribe( "audio.webm", language="vi", vad_filter=True, vad_parameters={"max_speech_duration_s": 15} ) ... # to .srt
result:
I am seeing the same issue. If the audio starts with music (say for 40s), the start time of the first segment often includes the music (e.g. starts from 0s, rather 41s). I also noticed that with word_timestamps
enabled, this problem can be alleviated.
Came here to confirm the same problem. word_timestamps
does seem to help though.
I tested some videos
if the silence duration is long , then enable vad_filter will be effective
but if video is as normal, then enable vad_filter may cause more timestamp mismatch
is there a good solution to apply to all videos(audios) , just try to make timestamp as accurate as possible