Preserving pts time gap on the audio to maintain video/audio sync

dodysw2 commented 1 year ago

For use case where whisper is used to transcribe video stream like mp4, sometimes the original stream (e.g RTMP) has network issues that results in pts (presentation timestamp) jumps. Currently on faster-whisper's load audio's ffmpeg pipeline, when demuxing from input, it eliminates this "jump", resulting in unsync transcription. That is, the resulting audio duration is shorter than the video.

One suggestion is to add resample async to the ffmpeg pipeline, which I have tried, and worked -- by converting to .wav using that ffmpeg cli command, before passing to whisper. Related issue mentioned here: https://stackoverflow.com/questions/52845150/use-ffmpeg-to-export-audios-with-gaps-filled . However this is an extra hoop that slows down transcription, and would be great if the same resampling done directly within faster-whisper, maybe as an option to transcribe().

Thanks.

guillaumekln commented 1 year ago

Do you know where I can find a file with these time gaps so that I can try implementing a solution?

dodysw2 commented 1 year ago

Here's one sample (originally .ts file renamed to mp4 so it can be uploaded) https://user-images.githubusercontent.com/46476260/230732618-b7b499d7-8c02-4811-82aa-bfc99528c1c1.mp4

SYSTRAN / faster-whisper

Preserving pts time gap on the audio to maintain video/audio sync #128