Rudrabha / Lip2Wav

This is the repository containing codes for our CVPR, 2020 paper titled "Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis"
MIT License
692 stars 152 forks source link

Dataset preprocessing #12

Closed naba89 closed 4 years ago

naba89 commented 4 years ago

Hi,

I am wondering about the pre-processing techniques that you use on your dataset. From what I could see, you save the frames where face is detected. But you don't consider clipping the audio based on whether face is detected or not. In that case how do ensure time alignment of the audio and video streams for training?

If I am wrong about my assumption, could you please point me to the correct pre-processing code.

Thanks! Nabarun

prajwalkr commented 4 years ago

For how we choose a video window, you can look at the feeder.py's getitem function which has several checks including the case when the sampled video window is missing a face. Once we get a contiguous video window (with all the faces present), we choose a corresponding audio window using the crop_audio_window function. There is also a check in the returned audio window.