facebookresearch / VisualVoice

Audio-Visual Speech Separation with Cross-Modal Consistency
Other
223 stars 35 forks source link

Questions about the network structure #5

Closed JusperLee closed 3 years ago

JusperLee commented 3 years ago

I found out from the code that you use a visual and auditory network with shared parameters for the visual and auditory features of the two speakers. But I'm not sure if my findings are correct as it doesn't seem to be stated in the paper.

rhgao commented 3 years ago

Yes, the same visual network is used for both speakers to extract visual features, and the same unet is used for separating voices of both speakers as well. For the with-context version, a single pass is used to separate voices for both speakers directly.

JusperLee commented 3 years ago

Thank you very much for your answer