YangLing0818 / RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)
https://proceedings.mlr.press/v235/yang24ai.html
MIT License
1.68k stars 97 forks source link

Is It Resizing or Just Fusion at Corresponding Positions? #38

Open Lne27 opened 7 months ago

Lne27 commented 7 months ago

I'd like to ask, during the stage of regional latent space fusion in different areas, is this method really resizing to the corresponding positions? Looking at the code, it seems that only the latent spaces of the corresponding positions in each regional image are fused, which is quite confusing?

CyrilSterling commented 6 months ago

Yes, I have the same question. The latent feature of the sub-region is directly cropped and not resized. https://github.com/YangLing0818/RPG-DiffusionMaster/blob/d2a26e9d199253ee49e75d348d4047d416a5b4e8/cross_attention.py#L127-L128 Then, the cropped features are fused with the corresponding positions of the base latent features. https://github.com/YangLing0818/RPG-DiffusionMaster/blob/d2a26e9d199253ee49e75d348d4047d416a5b4e8/cross_attention.py#L129-L133 It seems not resized as the paper say. And I'd like to know why this is done, is it because resize doesn't make sense?