Open TinyTigerPan opened 3 days ago
Hi,
For face/text regions I would expect the reconstruction quality is bad, mainly due to the limitation of the current released model was trained on ImageNet dataset, which does not include human/text in the classes (or in another word, they are not in the ImageNet data's distribution). If you play with MaskGIT-VQGAN or Taming-VQGAN trained on ImageNet, you will see the faces reconstruction are very bad as well (An example is that you may see the Fig. 5 of LFQ where the human faces are not in good quality).
As a comparison, SD-VAE was trained on OpenImages, LAION-Aesthetics and LAION-Humans, which definitely give them a better results on more general images.
Thanks for your reply.
Hi, thanks for your great work and model, I want talking about the reconstruction quality of released TiTok. I found the reconstruction visualization is not good, see the following picture. In the areas of text, human faces and so on, there seems to be full of artifacts.
I know that because the compression rate is high [(256, 256, 3) to (16, 12)], we can't expect such a high compression rate to get similar results as SDXL VAE, but I still want to ask if I increase the number and channel of latent tokens, will it get better. If with the same compression rate as SDXL, for example, latent size is (32*32, 16), do you think your TiTok will be better than SDXL VAE? And do you run any experiment about this.
Thanks again, and looking forward to communicating with you. 🤗