Closed JusperLee closed 3 years ago
Yes, the same visual network is used for both speakers to extract visual features, and the same unet is used for separating voices of both speakers as well. For the with-context version, a single pass is used to separate voices for both speakers directly.
Thank you very much for your answer
I found out from the code that you use a visual and auditory network with shared parameters for the visual and auditory features of the two speakers. But I'm not sure if my findings are correct as it doesn't seem to be stated in the paper.