A_select in training code

Hi, thank you so much for sharing your codes! When I was going through the code, there were several things that confuse me a lot:

I found this line that doesn't really make sense to me: https://github.com/Hangz-nju-cuhk/Talking-Face-Generation-DAVS/blob/209543944a05dc5b33f1ad019e04341b29388c5a/Gen_final_v1.py#L142 It seems that here the code is picking the "input" frame, but why is the upper limit of the random function set to be 28? It seems that each training sample should only have 25 frames... Is this a typo?
How do you actually process the audio & video inputs? More precisely, does each video frame correspond to 1/25 s of audio MFCC features?
Also, videos LRW dataset contain not only the labeled words, but also some other words in the video. Are there pre-processings that you perform so that the network only focus on that single word, or are you using the entire video clip?

Thanks a lot!!

Hangz-nju-cuhk / Talking-Face-Generation-DAVS