FoundationVision / LlamaGen

Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
https://arxiv.org/abs/2406.06525
MIT License
1.34k stars 55 forks source link

About ROPE in sample process #54

Open Leedonus opened 3 months ago

Leedonus commented 3 months ago

Hi, thanks for the interesting work. I want to know why the PE of the text token in generating process is set to zero? image

daiyixiang666 commented 3 months ago

the PE Is trying to tell the model about the relative position to the image, for example, a pixel should have more relation with the nearly by pixel. But in here we do not know the relative relation between the image latent patch and the sentence

zxduan90 commented 3 months ago

@daiyixiang666 But these zeros will make the q and k of the text_token zero, according to https://github.com/FoundationVision/LlamaGen/blob/ce98ec41803a74a90ce68c40ababa9eaeffeb4ec/autoregressive/models/gpt.py#L220 , then the attention of the output token and the text token will be the same

daiyixiang666 commented 3 months ago

Oh, I see, so the text only make effect via xv?

zxduan90 commented 3 months ago

@daiyixiang666 yes, i think there is a problem in it

zxduan90 commented 3 months ago

The text token has less relevance with the image compared to stable diffusion method ![Uploading Screenshot from 2024-08-06 18-03-56.png…]()

daiyixiang666 commented 3 months ago

And it will also mean the text attention mask is useless?