Open varadgunjal opened 2 years ago
A good question. In our preliminary experiments, we found that using different sequences can slightly improve the model performance, it seems that the randomness in the vqgan encoding process becomes some data augments or label smoothing. But we didn't conduct a more in-depth quantitative study.
I see. So what you're saying is that there is some value in using multiple (slightly different) sequences representing the same image and this could be interpreted as data augmentation on the sequences used for the Image Infilling task. Interesting take. I would like to try and explore this further.
I understand from #258 that there is randomness in the generated VQGAN code sequences because of Gumbel Softmax, but the different sequences nevertheless reconstruct to similar looking images. However, since the training is done by predicting the sequence tokens and not by comparing the reconstructed images themselves, I am wondering if and how having different token sequences will affect the pretraining and downstream performance? Was this something that had been investigated to check for consistency in performance across different variations of the generated code sequences?