facebookresearch / VisualVoice

Audio-Visual Speech Separation with Cross-Modal Consistency
Other
218 stars 35 forks source link

Why num_frames is 64 not 75 or other number? #8

Closed MessyPaste closed 3 years ago

MessyPaste commented 3 years ago

Thanks for your great work. I was curious about the parameter of num_frames. Why we only get 64 frames of mouth ROI for 2.55 seconds? The end of 10 frames is abandoned? I can't figure it out. Thanks again.

rhgao commented 3 years ago

We convert the video to 25f/s, so 64 frames are roughly 2.55 seconds.

MessyPaste commented 3 years ago

We convert the video to 25f/s, so 64 frames are roughly 2.55 seconds.

Thanks for your quick reply!

So "2.55s" is tailored to 64 dimensions, which is consistent with the face embedding dimensions.

Am I right?