Open mysxs opened 1 year ago
Hi,
The audio and visual features are aligned at the frame level. The original frame rate for audio (i.e., filterbank feature) and video are 100 and 25 respectively. We concatenate 4 consecutive audio features as one frame before feeding them into the model, which makes the audio and video input of the same length. The audio and visual features are then concatenated per frame for fusion.
Hi, Your answer is very helpful to me, thank you very much!!!
I have read your paper, but I still don't quite understand how the two modes are aligned and fused. Can you tell me? Thank you!!