Open Leedonus opened 3 months ago
the PE Is trying to tell the model about the relative position to the image, for example, a pixel should have more relation with the nearly by pixel. But in here we do not know the relative relation between the image latent patch and the sentence
@daiyixiang666 But these zeros will make the q and k of the text_token zero, according to https://github.com/FoundationVision/LlamaGen/blob/ce98ec41803a74a90ce68c40ababa9eaeffeb4ec/autoregressive/models/gpt.py#L220 , then the attention of the output token and the text token will be the same
Oh, I see, so the text only make effect via xv?
@daiyixiang666 yes, i think there is a problem in it
The text token has less relevance with the image compared to stable diffusion method ![Uploading Screenshot from 2024-08-06 18-03-56.png…]()
And it will also mean the text attention mask is useless?
Hi, thanks for the interesting work. I want to know why the PE of the text token in generating process is set to zero?