FoundationVision / VAR

[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
MIT License
3.8k stars 285 forks source link

Actual Training Loss Curve #23

Closed chenllliang closed 2 months ago

chenllliang commented 2 months ago

In the VAR model, a standard cross-entropy loss is employed; Yet Figure 5 illustrates the scaling law on a modified loss function. Could you provide the precise equation detailing how the actual training and test loss are converted to this reduced form? Additionally, is there a training log available that records the actual cross-entropy loss values over the course of the training? Access to this information would greatly enhance readers' comprehension of the training process dynamics for the VAR model.

Thanks a lot!

keyu-tian commented 2 months ago

hi @chenllliang, following previous literature about scaling laws in NLP, we split cross-entropy loss (on the test set) into to parts: CE(ground_truth, prediction) = E(ground_truth) + D_KL(ground_truth, prediction). E(ground_truth) is the so-called irreducable loss and D_KL(ground_truth, prediction) the reducable loss. Figure 5 illustrates the reducable loss, which is just CE minus E(ground_truth). We get E(ground_truth) from the ground truth token distribution of a trained tokenizer. It's 4.5 for the last scale, and 5.1 for all scales' average.

I'll post some training logs soon, please stay tuned.

chenllliang commented 2 months ago

Thanks for your information! Can you provide some math or code of calculating the E(ground_truth) from the ground truth token distribution? Is it a next-token prediction entropy?

keyu-tian commented 2 months ago

It's obtained by forwarding VQVAE on every ImageNet image, getting the probabilities on each word of the VQ vocabulary, and calculating the entropy.