Hi, nice work there. Some detailed question about the autoregressive inference and training. The input token of the VAR is not normal. I see the code that you use the upsample features from the previous level of token embedding.
f_hat, next_token_map = self.vae_quant_proxy[0].get_next_autoregressive_input(si, len(self.patch_nums), f_hat, h_BChw)
This is great.
But during inference,
next_token_map = self.word_embed(next_token_map) + lvl_pos[:, cur_L:cur_L + self.patch_nums[si+1] ** 2]
why not using all the previous feature pyramid within token level at 1~n-1, but only the feature pyramid at level n-1.
This is strange since the attention mask in training shows that you only mask the output from level from n+1 to N, that means, in training, you use all the feature pyramid at level from 1 to n-1.
Hi, nice work there. Some detailed question about the autoregressive inference and training. The input token of the VAR is not normal. I see the code that you use the upsample features from the previous level of token embedding.
f_hat, next_token_map = self.vae_quant_proxy[0].get_next_autoregressive_input(si, len(self.patch_nums), f_hat, h_BChw)
This is great. But during inference,next_token_map = self.word_embed(next_token_map) + lvl_pos[:, cur_L:cur_L + self.patch_nums[si+1] ** 2]
why not using all the previous feature pyramid within token level at 1~n-1, but only the feature pyramid at level n-1. This is strange since the attention mask in training shows that you only mask the output from level from n+1 to N, that means, in training, you use all the feature pyramid at level from 1 to n-1.
Hope for your answering^_^.