512 images decoder result are extremly bad

lucasjinreal commented 1 week ago

Got an image like this

Is there a way to reconstruct it?

Currently only got:

juncongmoo commented 1 week ago

This model is trained with imagenet, so it is not for text rich document-like image.

cornettoyu commented 1 week ago

Got an image like this

Is there a way to reconstruct it?

Currently only got:

Hi, as we currently only released the TiTok at resolution 256, may I ask how did you apply it on your image? Simple resize/crop should be fine. If you want to apply current released model to a larger resolution image, it could be challenging as ViT itself is known to not generalize to different resolution well. Some solutions (but mostly requires a re-training) to improve ViT at arbitrary resolution/aspect ratio do exist but they are beyond this project's scope. I attached them for a reference if you are interested.

https://arxiv.org/abs/2307.06304 https://huggingface.co/adept/fuyu-8b

lucasjinreal commented 1 week ago

@cornettoyu Input size is really my concern. For ViT based tokenizer, it has to be resize, the the output would be very blurred when resize back, I tried with Open-Magivt2, it trained with 128, encoded this image perfectly, and redecode at original flawlessly.

I am simply resize the input image and feed into model.

Yes, you were right, currently I am afraid using ViT as image tokenizer isn't a good choice, as they always constrainted by input size. OpenMagvit2, on the other hand, has no problem with it. However, their issue is, tokens are too much...

XPLearner commented 5 days ago

bad result also

bytedance / 1d-tokenizer

512 images decoder result are extremly bad #3