fudan-generative-vision / hallo

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
https://fudan-generative-vision.github.io/hallo/
MIT License
9.49k stars 1.3k forks source link

Haramard product between two different spatial-sized matrices in Hierarchical Audio-Visual Cross Attention #147

Closed justin4ai closed 4 months ago

justin4ai commented 4 months ago

Hello, first of all I really appreciate you provide the world with such a great project.

I am reading your paper line by line and hit upon one question seeing the following part of it:

image .

To my knowledge, the masks M are obtained from the face image I and therefore have the same spatial size with it while o, the output of cross attention between the latent representations and audio features, has the same spatial size with the latent.

The spatial sizes won't be matched according to my understanding. Is this what you mean by ,,different scaled latent representations'' (red-colored phrase)? If yes, how's the hadamard product between two different spatial sizes implemented?

I'm looking forward to hearing from you about this question. Thanks for the great project!

Cheers, Justin

justin4ai commented 4 months ago

image

Also seemingly your notation for the regionally masked attention output about poses is changing from b_t^s to a_t^s without any specificatation.

crystallee-ai commented 4 months ago

Thanks for your attention! First, we resize three masks to different scale to match the latent shape. Then we apply hadamard product. And for the notation mistake, I apologize for that. Thanks!

justin4ai commented 4 months ago

Thanks for your kind and quick response :)