min_silence_duration_ms is not working, Silence detection not working

SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2

MIT License

12.59k stars 1.05k forks source link

min_silence_duration_ms is not working, Silence detection not working #1108

Closed andriken closed 1 week ago

andriken commented 3 weeks ago

see my code below simple, but then why the segment has more than around 1 second of silence in between it, even if I set the "min_silence_duration_ms" to 400 or less it's still same not affect.

from faster_whisper import WhisperModel

model_size = "kotoba-tech/kotoba-whisper-v2.0-faster"

model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, _ = model.transcribe(
    "ceremony.mp4",
    task="translate",
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=200),
)

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

[0.72s -> 6.28s] After that, I talked a lot with my mother about the past three years.

MahmoudAshraf97 commented 3 weeks ago

I don't see the problem exactly in what you are showing, how can you tell that it's not working?

andriken commented 2 weeks ago

maybe I'm using wrong parameter for my purpose, I'm sorry but there is 1 second of silence during the duration of this segment then how come the segment still covers the silence part too? I tried large-V2 as well same thing.

MahmoudAshraf97 commented 2 weeks ago

Silence is removed and the speech segments are concatenated together, the timestamps are restored to the original before silence removal, you should not notice anything except better transcription quality, but the segments are not split at silence

andriken commented 2 weeks ago

but what if I don't want it to concatenate by using the silence timing! I just want it to split at silence is there a way? I want the transcription accurate close to dubbing.

MahmoudAshraf97 commented 2 weeks ago

the easiest solution is to use word_timestamps=True and then align the words as you like, other than that you'll have to customize the code to behave the way you want it