Closed ss87021456 closed 7 years ago
We took 0.3-sec synced frames that each of them consists 15 audio frames and 9 video frames and they are correspondent. Please refer to the paper for further details (Section IV).
Okies Thanks
On Wed, Oct 4, 2017 at 9:21 AM, Amirsina Torfi notifications@github.com wrote:
Closed #3 https://github.com/astorfi/lip-reading-deeplearning/issues/3.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/astorfi/lip-reading-deeplearning/issues/3#event-1277215265, or mute the thread https://github.com/notifications/unsubscribe-auth/AdXUrLEWllHuTcEDHgo7cAhjlj2wDr0hks5sowC_gaJpZM4PdjtI .
Hello, I wonder to know that how do you deal with the lip-movement frame and audio synchronization? Since the input FPS may vary from videos to videos, e.g. 30 FPS means 33ms per frame, so each frame will represent 33ms audio. How do you deal with the video-audio corresponding pair?