FoundationVision / VAR

[NeurIPS 2024 Oral][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
MIT License
4.28k stars 315 forks source link

Question on autoregressive_infer_cfg #56

Closed sparse-mvs-2 closed 6 months ago

sparse-mvs-2 commented 6 months ago

Hi, nice work there. Some detailed question about the autoregressive inference and training. The input token of the VAR is not normal. I see the code that you use the upsample features from the previous level of token embedding. f_hat, next_token_map = self.vae_quant_proxy[0].get_next_autoregressive_input(si, len(self.patch_nums), f_hat, h_BChw) This is great. But during inference, next_token_map = self.word_embed(next_token_map) + lvl_pos[:, cur_L:cur_L + self.patch_nums[si+1] ** 2]

why not using all the previous feature pyramid within token level at 1~n-1, but only the feature pyramid at level n-1. This is strange since the attention mask in training shows that you only mask the output from level from n+1 to N, that means, in training, you use all the feature pyramid at level from 1 to n-1.

Hope for your answering^_^.

sen-ye commented 6 months ago

Because kv cache is enabled during inference, the attention is actually using all features from 1~n-1

sparse-mvs-2 commented 6 months ago

Got it!. Thanks for the reply.