choijeongsoo / lip2speech-unit

[Interspeech 2023] Intelligible Lip-to-Speech Synthesis with Speech Units
Other
25 stars 2 forks source link

Duplicate Video Frame feature #11

Closed longkhanh-fam closed 2 weeks ago

longkhanh-fam commented 1 month ago

Hi, I really want to understand your work. In your model and avhubert_model, I saw that you have duplicated video features regarding time dimension in (B, T x 2, D) by x = x.repeat_interleave(2, dim=1) Furthermore, I also realized that you also reshape the shape of output of mel-spectrogram decoder into (B, T x 2 , D // 2) And that is the reason why the padding mask of target is repeat 4 times regarding time dimension in criterion class.

Can you please explain why repeating the time dimension for two times like that.

choijeongsoo commented 2 weeks ago

Hi, thank you for your interest in our work! Sorry for late reply.

video: 25fps, speech unit: 50fps, mel-spectrogram: 100fps

To align the lengths, there are several options, such as repeating frames or using transposed convolution. We found that repeating frames before applying the Conformer works, so that’s the method we used.