[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
Hello, I have a question about the image reconsturction via VAR.
I want the transformer model to predict the ground truth tokens, just like in the training situation, by obtaining image tokens through an vq-encoder, and then interpolating the tokens, finally inputting them into the transformer. (like inversion in diffusion models)
However, when I configured the code, there was a difference from the original image.
Could I have missed something, or is this approach not feasible?
Here's original image and recon image.
original image
Hello, I have a question about the image reconsturction via VAR. I want the transformer model to predict the ground truth tokens, just like in the training situation, by obtaining image tokens through an vq-encoder, and then interpolating the tokens, finally inputting them into the transformer. (like inversion in diffusion models)
However, when I configured the code, there was a difference from the original image. Could I have missed something, or is this approach not feasible?
Here's original image and recon image. original image
recon image
And Here's my code.
tr_input_embed's shape is [B, 679, 32] And I implement this code in the VAR class.
Again, thanks for your great work!