Duplicate Video Frame feature

choijeongsoo / lip2speech-unit

[Interspeech 2023] Intelligible Lip-to-Speech Synthesis with Speech Units

Other

25 stars 2 forks source link

Hi, I really want to understand your work. In your model and avhubert_model, I saw that you have duplicated video features regarding time dimension in (B, T x 2, D) by x = x.repeat_interleave(2, dim=1) Furthermore, I also realized that you also reshape the shape of output of mel-spectrogram decoder into (B, T x 2 , D // 2) And that is the reason why the padding mask of target is repeat 4 times regarding time dimension in criterion class.

Can you please explain why repeating the time dimension for two times like that.

choijeongsoo / lip2speech-unit

Duplicate Video Frame feature #11