train stage1, not use audio feature, only learn the image generation?

fudan-generative-vision / hallo

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

https://fudan-generative-vision.github.io/hallo/

MIT License

9.49k stars 1.3k forks source link

train stage1, not use audio feature, only learn the image generation? #158

Open monkeyCv opened 4 months ago

monkeyCv commented 4 months ago

Hi, authors. Thanks for your greate work. I have a question about stage1 training. It doesnot have an input of audio feature. So, what is the meaning of the stage1. Just think that, we have same ref image and same ref image embedding, but we have to generate two different images? Thanks.

xumingw commented 4 months ago

It trained the referencenet and spatial part of denoising unet. Given a ref image, it should generate a random image but keeping main feature of refimage