Closed CHNxindong closed 5 months ago
我也想知道
I think your second question already corresponds to the answer to your first question. It can be seen that the only difference between eq.(3) and (5) is the target pose. These two target poses correspond to the original image and the target image in the training pair respectively. In the diffusion model, the latents sent to unet during training are the noisy generated target images, which is the noisy form of the groundtruth image. Hope this helps.
I see it. Thanks for your quick and patient reply.
Thanks for your great work and released code!
I have two problems for code in pose_transfer_train.py:
the losses said in paper are reconstruction loss and mse loss: But there is only 1 line code in implementation:
Why do the pose_img_src and pose_img_tgt are concated for pose encoder? And why do the img_src and img_tgt are concated for input?