Open XiaoqiangZhou opened 1 year ago
Well, I think the actual main model is class "BeatGANsAutoencModel" instead of class "BeatGANsPoseGuideModel". And the multiscale condition feature is saved in variable "enc_cond_emb" "mid_cond_emb" and "dec_cond_emb". Is it right?
Thanks for sharing this great work.
In the paper, you mentioned that "transfer rich multi-scale texture patterns from the source image distribution to the noise prediction"
How ever, in the code, I find that just the last layer feature of the encoder is used for cross attention. As the [-1] means:
pose_out = self.cros_attn2(x = xt_feats[-1], cond = pose_feats[-1]).mean([2,3])
Could you please briefly tell me where is the implementation of "multi-scale" feature for cross attention?