Closed chenllliang closed 2 months ago
hi @chenllliang, following previous literature about scaling laws in NLP, we split cross-entropy loss (on the test set) into to parts: CE(ground_truth, prediction) = E(ground_truth) + D_KL(ground_truth, prediction). E(ground_truth) is the so-called irreducable loss and D_KL(ground_truth, prediction) the reducable loss. Figure 5 illustrates the reducable loss, which is just CE minus E(ground_truth). We get E(ground_truth) from the ground truth token distribution of a trained tokenizer. It's 4.5 for the last scale, and 5.1 for all scales' average.
I'll post some training logs soon, please stay tuned.
Thanks for your information! Can you provide some math or code of calculating the E(ground_truth) from the ground truth token distribution? Is it a next-token prediction entropy?
It's obtained by forwarding VQVAE on every ImageNet image, getting the probabilities on each word of the VQ vocabulary, and calculating the entropy.
In the VAR model, a standard cross-entropy loss is employed; Yet Figure 5 illustrates the scaling law on a modified loss function. Could you provide the precise equation detailing how the actual training and test loss are converted to this reduced form? Additionally, is there a training log available that records the actual cross-entropy loss values over the course of the training? Access to this information would greatly enhance readers' comprehension of the training process dynamics for the VAR model.
Thanks a lot!