Some problems of my unofficial implementation

Hi,

I have unofficially reproduced the code for 'Animate Anyone' based on the description in your paper. However, I encountered two issues during the training process:

Currently, with a single GPU and a batch size of 2, I have trained for 8k iterations. The generated images show a significant difference in the background compared to the target images, which are pure white. The third row in the following figure.

The faces reconstructed by the VAE decoding exhibit distortion. I'm wondering if it's possible to utilize the latent diffusion model to capture the information lost by the VAE and correct the distorted faces. In your video demo, the faces appear clear, and I'm unsure how to address this issue.

HumanAIGC / AnimateAnyone

Some problems of my unofficial implementation #43