Hangz-nju-cuhk / Talking-Face-Generation-DAVS

Code for Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)
MIT License
818 stars 173 forks source link

A_select in training code #30

Open zfang399 opened 5 years ago

zfang399 commented 5 years ago

Hi, thank you so much for sharing your codes! When I was going through the code, there were several things that confuse me a lot:

  1. I found this line that doesn't really make sense to me: https://github.com/Hangz-nju-cuhk/Talking-Face-Generation-DAVS/blob/209543944a05dc5b33f1ad019e04341b29388c5a/Gen_final_v1.py#L142 It seems that here the code is picking the "input" frame, but why is the upper limit of the random function set to be 28? It seems that each training sample should only have 25 frames... Is this a typo?

  2. How do you actually process the audio & video inputs? More precisely, does each video frame correspond to 1/25 s of audio MFCC features?

  3. Also, videos LRW dataset contain not only the labeled words, but also some other words in the video. Are there pre-processings that you perform so that the network only focus on that single word, or are you using the entire video clip?

Thanks a lot!!

Hangz-nju-cuhk commented 5 years ago

@zfang399 Thank you for your interest.

(1) Each video is about 1.2s in the LRW dataset, leading to 29 frames for each video. (2) We use the save_mfccs.m file for preprocessing the audios, please look into it for details. (3) We use the entire video clip.