Closed naba89 closed 4 years ago
For how we choose a video window, you can look at the feeder.py
's getitem function which has several checks including the case when the sampled video window is missing a face. Once we get a contiguous video window (with all the faces present), we choose a corresponding audio window using the crop_audio_window function. There is also a check in the returned audio window.
Hi,
I am wondering about the pre-processing techniques that you use on your dataset. From what I could see, you save the frames where face is detected. But you don't consider clipping the audio based on whether face is detected or not. In that case how do ensure time alignment of the audio and video streams for training?
If I am wrong about my assumption, could you please point me to the correct pre-processing code.
Thanks! Nabarun