Closed justin4ai closed 4 months ago
Also seemingly your notation for the regionally masked attention output about poses is changing from b_t^s to a_t^s without any specificatation.
Thanks for your attention! First, we resize three masks to different scale to match the latent shape. Then we apply hadamard product. And for the notation mistake, I apologize for that. Thanks!
Thanks for your kind and quick response :)
Hello, first of all I really appreciate you provide the world with such a great project.
I am reading your paper line by line and hit upon one question seeing the following part of it:
.
To my knowledge, the masks M are obtained from the face image I and therefore have the same spatial size with it while o, the output of cross attention between the latent representations and audio features, has the same spatial size with the latent.
The spatial sizes won't be matched according to my understanding. Is this what you mean by ,,different scaled latent representations'' (red-colored phrase)? If yes, how's the hadamard product between two different spatial sizes implemented?
I'm looking forward to hearing from you about this question. Thanks for the great project!
Cheers, Justin