Open lsy492 opened 3 months ago
During the training, I saw how the code only input the information from the first 0 to 40 frames of the speaker, predicted the information of the listener from 40 to 64 frames, and performed cross-entropy loss with the real value.
Hello author, when I looked at your paper and code, I encountered doubts. In the paper, your flow chart only shows the process of predicting the listener from frames 32 to 39. How to predict frames 40 to 64? In the code Where is it reflected in?