End timestamps off by 100-300ms+

End timestamps seem to be off by 100-300ms+ at times. This could possible be due to the current "hacky" segmentation algorithm here:

https://github.com/guillaumekln/faster-whisper/blob/5c17de17713f65929c7c33add3a9735ff75a945c/faster_whisper/transcribe.py#L734

One solution could be to monkey-patch faster-whisper to A) use VAD timestamps to get the median and maximum values, and B) add punctuation "durations" to the previous word. Something along the lines of:

word_durations = np.array([round((word["end"] - word["start"])/16000, 2) for word in speech_chunks]) # A)
word_durations = word_durations[word_durations.nonzero()]

median_duration = np.median(word_durations) if len(word_durations) > 0 else 0.0
max_duration = median_duration * 2

if len(word_durations) > 0:
    sentence_end_marks = ".。!！?？"
    for i in range(0, len(alignment)):
        print(f"Alignment duration: {alignment[i]['end'] - alignment[i]['start']}")
        if alignment[i]["end"] - alignment[i]["start"] > max_duration:
            if alignment[i]["word"] in sentence_end_marks:
                alignment[i]["end"] = alignment[i]["start"] + max_duration
            elif alignment[i - 1]["word"] in sentence_end_marks:
                alignment[i]["start"] = alignment[i]["end"] - max_duration
        elif alignment[i]["word"] in sentence_end_marks and i > 0:
            alignment[i-1]["end"] += (alignment[i]["end"] - alignment[i]["start"])/1.5 # B) 1.5 is arbitrary

Wordcab / wordcab-transcribe

End timestamps off by 100-300ms+ #174