FoundationVision / VAR

[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
MIT License
4.03k stars 303 forks source link

Why multi-scale features partially shared a convolution network via PhiPartiallyShared #73

Open sunset-clouds opened 3 months ago

sunset-clouds commented 3 months ago

VAR is indeed impressive, but there’s one issue that’s been bothering me. We reached out to the authors for assistance with the matter, and we appreciate your help.

In the quant.py line 33: self.quant_resi = PhiPartiallyShared(nn.ModuleList([(Phi(Cvae, quant_resi) if abs(quantresi) > 1e-6 else nn.Identity()) for in range(share_quant_resi)])). phi

According to my understanding, self.quant_resi is the $\phi_k(\cdot)$ function. There are 4 different $\phi_k(\cdot)$, and some scales share the same $\phi_k(\cdot)$, for example: $\phi_1(\cdot) = \phi_2(\cdot)$, $\phi_3(\cdot) = \phi_4(\cdot) = \phi_5(\cdot) $, $\phi_6(\cdot) = \phi_7(\cdot)$, $\phi_8(\cdot) = \phi9(\cdot) = \phi{10}(\cdot) $. I have two questions: 1) why we need to introduce $\phi_k(\cdot)$, I feel this is somewhat counterintuitive. In contrast, in RQ-VAE, it adopts: $f = f-z_k$ instead of $f = f-\phi_k(z_k)$. I want to know the true role of $\phi_k(\cdot)$; 2) why different scale share the same $\phi_k(\cdot)$, e.g., $\phi_1(\cdot) = \phi_2(\cdot)$, $\phi_3(\cdot) = \phi_4(\cdot) = \phi_5(\cdot) $?

eyedealism commented 2 months ago

In the paper, it says: To address the information loss in upscaling. It's like the decoder part of UNet to generate a smoother map, I guess.