Closed longkhanh-fam closed 2 weeks ago
Hi, thank you for your interest in our work! Sorry for late reply.
video: 25fps, speech unit: 50fps, mel-spectrogram: 100fps
To align the lengths, there are several options, such as repeating frames or using transposed convolution. We found that repeating frames before applying the Conformer works, so that’s the method we used.
Hi, I really want to understand your work. In your model and avhubert_model, I saw that you have duplicated video features regarding time dimension in (B, T x 2, D) by x = x.repeat_interleave(2, dim=1) Furthermore, I also realized that you also reshape the shape of output of mel-spectrogram decoder into (B, T x 2 , D // 2) And that is the reason why the padding mask of target is repeat 4 times regarding time dimension in criterion class.
Can you please explain why repeating the time dimension for two times like that.