Wordcab / wordcab-transcribe

💬 ASR FastAPI server using faster-whisper and Multi-Scale Auto-Tuning Spectral Clustering for diarization.
https://wordcab.github.io/wordcab-transcribe/
MIT License
188 stars 26 forks source link

End timestamps off by 100-300ms+ #174

Closed aleksandr-smechov closed 1 year ago

aleksandr-smechov commented 1 year ago

End timestamps seem to be off by 100-300ms+ at times. This could possible be due to the current "hacky" segmentation algorithm here:

https://github.com/guillaumekln/faster-whisper/blob/5c17de17713f65929c7c33add3a9735ff75a945c/faster_whisper/transcribe.py#L734

One solution could be to monkey-patch faster-whisper to A) use VAD timestamps to get the median and maximum values, and B) add punctuation "durations" to the previous word. Something along the lines of:

word_durations = np.array([round((word["end"] - word["start"])/16000, 2) for word in speech_chunks]) # A)
word_durations = word_durations[word_durations.nonzero()]

median_duration = np.median(word_durations) if len(word_durations) > 0 else 0.0
max_duration = median_duration * 2

if len(word_durations) > 0:
    sentence_end_marks = ".。!!??"
    for i in range(0, len(alignment)):
        print(f"Alignment duration: {alignment[i]['end'] - alignment[i]['start']}")
        if alignment[i]["end"] - alignment[i]["start"] > max_duration:
            if alignment[i]["word"] in sentence_end_marks:
                alignment[i]["end"] = alignment[i]["start"] + max_duration
            elif alignment[i - 1]["word"] in sentence_end_marks:
                alignment[i]["start"] = alignment[i]["end"] - max_duration
        elif alignment[i]["word"] in sentence_end_marks and i > 0:
            alignment[i-1]["end"] += (alignment[i]["end"] - alignment[i]["start"])/1.5 # B) 1.5 is arbitrary
aleksandr-smechov commented 1 year ago

https://github.com/guillaumekln/faster-whisper/pull/123/files This seems to work as well.

EDIT: Have noticed issues with this solution, in that timestamps sometimes "go backwards". The original example above works well across three tests.