fudan-generative-vision / hallo

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
https://fudan-generative-vision.github.io/hallo/
MIT License
9.14k stars 1.25k forks source link

Inquiry Regarding Reference Net #183

Open Nyquist0 opened 4 weeks ago

Nyquist0 commented 4 weeks ago

Dear Sir or Madam,

I am writing to ask why you would use reference to keep the identity feature. But I think if you directly integrate the face embedding to the denoising net by cross attention, it should also work.

May I ask why you use reference net? I am guessing it might follow EMO. But is there any reason from principle?

Looking for your reply. Best~

xumingw commented 3 weeks ago

reference net helps keeping both face appearance and background. Many stable diffusion based model follow this manner.