SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.96k stars 1k forks source link

Word-level timestamps are off by some multiplier #786

Open nikans opened 6 months ago

nikans commented 6 months ago

Hello. I've just updated from 0.1.0 to 1.0.1 version of the library and noticed that timings are incorrect, like it's transcribing a longer audio.

For example, 23 seconds of subtitles:

{"segments": [{"id": 1, "end": 12.16, "start": 4.500000000000001, "words": [{"end": 5.38, "start": 4.500000000000001}, {"end": 5.74, "start": 5.38}, {"end": 6.18, "start": 5.74}, {"end": 6.5, "start": 6.18}, {"end": 6.78, "start": 6.5}, {"end": 7.2, "start": 6.78}, {"end": 8.18, "start": 7.2}, {"end": 9.12, "start": 8.94}, {"end": 9.52, "start": 9.12}, {"end": 9.94, "start": 9.52}, {"end": 10.42, "start": 9.94}, {"end": 10.72, "start": 10.42}, {"end": 11.14, "start": 10.72}, {"end": 11.48, "start": 11.14}, {"end": 12.16, "start": 11.48}]}, {"id": 2, "end": 16.44, "start": 12.9, "words": [{"end": 13.08, "start": 12.9}, {"end": 13.36, "start": 13.08}, {"end": 13.74, "start": 13.36}, {"end": 14.62, "start": 13.74}, {"end": 15.28, "start": 14.62}, {"end": 15.94, "start": 15.28}, {"end": 16.44, "start": 15.94}]}, {"id": 3, "end": 22.58, "start": 19.38, "words": [{"end": 19.68, "start": 19.38}, {"end": 19.94, "start": 19.68}, {"end": 20.46, "start": 19.94}, {"end": 21.06, "start": 20.46}, {"end": 21.58, "start": 21.06}, {"end": 22.14, "start": 21.58}, {"end": 22.58, "start": 22.14}]}]}

for an 18-seconds audio file:

20240412040528-962.wav.zip

I've tried with VAD filter off and on. Anyway, I don't understand how exactly should VAD affect this. I also tried with distil and a regular fw models (all medium). Same.

What could have gone wrong? Thanks.

nikans commented 6 months ago

The multiplier seems to be ~0.72. So when corrected by this value, the timings point to the right audio time.