SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.24k stars 936 forks source link

Preserving pts time gap on the audio to maintain video/audio sync #128

Open dodysw2 opened 1 year ago

dodysw2 commented 1 year ago

For use case where whisper is used to transcribe video stream like mp4, sometimes the original stream (e.g RTMP) has network issues that results in pts (presentation timestamp) jumps. Currently on faster-whisper's load audio's ffmpeg pipeline, when demuxing from input, it eliminates this "jump", resulting in unsync transcription. That is, the resulting audio duration is shorter than the video.

One suggestion is to add resample async to the ffmpeg pipeline, which I have tried, and worked -- by converting to .wav using that ffmpeg cli command, before passing to whisper. Related issue mentioned here: https://stackoverflow.com/questions/52845150/use-ffmpeg-to-export-audios-with-gaps-filled . However this is an extra hoop that slows down transcription, and would be great if the same resampling done directly within faster-whisper, maybe as an option to transcribe().

Thanks.

guillaumekln commented 1 year ago

Do you know where I can find a file with these time gaps so that I can try implementing a solution?

dodysw2 commented 1 year ago

Here's one sample (originally .ts file renamed to mp4 so it can be uploaded) https://user-images.githubusercontent.com/46476260/230732618-b7b499d7-8c02-4811-82aa-bfc99528c1c1.mp4