How to predict 32 to 64 frames of the speaker when 40 frames of speaker information are input during training

evonneng / learning2listen

Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

106 stars 10 forks source link

How to predict 32 to 64 frames of the speaker when 40 frames of speaker information are input during training #17

Open lsy492 opened 3 months ago

lsy492 commented 3 months ago

Hello author, when I looked at your paper and code, I encountered doubts. In the paper, your flow chart only shows the process of predicting the listener from frames 32 to 39. How to predict frames 40 to 64? In the code Where is it reflected in？

lsy492 commented 3 months ago

During the training, I saw how the code only input the information from the first 0 to 40 frames of the speaker, predicted the information of the listener from 40 to 64 frames, and performed cross-entropy loss with the real value.