Hi, authors. Thanks for your greate work. I have a question about stage1 training. It doesnot have an input of audio feature. So, what is the meaning of the stage1. Just think that, we have same ref image and same ref image embedding, but we have to generate two different images? Thanks.
It trained the referencenet and spatial part of denoising unet. Given a ref image, it should generate a random image but keeping main feature of refimage
Hi, authors. Thanks for your greate work. I have a question about stage1 training. It doesnot have an input of audio feature. So, what is the meaning of the stage1. Just think that, we have same ref image and same ref image embedding, but we have to generate two different images? Thanks.