facebookresearch / av_hubert

A self-supervised learning framework for audio-visual speech
Other
831 stars 132 forks source link

How to align and fuse acoustic and visual features #65

Open mysxs opened 1 year ago

mysxs commented 1 year ago

I have read your paper, but I still don't quite understand how the two modes are aligned and fused. Can you tell me? Thank you!!

chevalierNoir commented 1 year ago

Hi,

The audio and visual features are aligned at the frame level. The original frame rate for audio (i.e., filterbank feature) and video are 100 and 25 respectively. We concatenate 4 consecutive audio features as one frame before feeding them into the model, which makes the audio and video input of the same length. The audio and visual features are then concatenated per frame for fusion.

mysxs commented 1 year ago

Hi, Your answer is very helpful to me, thank you very much!!!