bytedance / 1d-tokenizer

This repo contains the code for our paper An Image is Worth 32 Tokens for Reconstruction and Generation
Apache License 2.0
222 stars 7 forks source link

Reconstruction quality of released TiTok #10

Open TinyTigerPan opened 3 days ago

TinyTigerPan commented 3 days ago

Hi, thanks for your great work and model, I want talking about the reconstruction quality of released TiTok. I found the reconstruction visualization is not good, see the following picture. In the areas of text, human faces and so on, there seems to be full of artifacts. test_2 I know that because the compression rate is high [(256, 256, 3) to (16, 12)], we can't expect such a high compression rate to get similar results as SDXL VAE, but I still want to ask if I increase the number and channel of latent tokens, will it get better. If with the same compression rate as SDXL, for example, latent size is (32*32, 16), do you think your TiTok will be better than SDXL VAE? And do you run any experiment about this. Thanks again, and looking forward to communicating with you. 🤗

cornettoyu commented 2 days ago

Hi,

For face/text regions I would expect the reconstruction quality is bad, mainly due to the limitation of the current released model was trained on ImageNet dataset, which does not include human/text in the classes (or in another word, they are not in the ImageNet data's distribution). If you play with MaskGIT-VQGAN or Taming-VQGAN trained on ImageNet, you will see the faces reconstruction are very bad as well (An example is that you may see the Fig. 5 of LFQ where the human faces are not in good quality).

As a comparison, SD-VAE was trained on OpenImages, LAION-Aesthetics and LAION-Humans, which definitely give them a better results on more general images.

TinyTigerPan commented 2 days ago

Thanks for your reply.