Open lucasjinreal opened 1 week ago
This model is trained with imagenet, so it is not for text rich document-like image.
Got an image like this
Is there a way to reconstruct it?
Currently only got:
![]()
Hi, as we currently only released the TiTok at resolution 256, may I ask how did you apply it on your image? Simple resize/crop should be fine. If you want to apply current released model to a larger resolution image, it could be challenging as ViT itself is known to not generalize to different resolution well. Some solutions (but mostly requires a re-training) to improve ViT at arbitrary resolution/aspect ratio do exist but they are beyond this project's scope. I attached them for a reference if you are interested.
https://arxiv.org/abs/2307.06304 https://huggingface.co/adept/fuyu-8b
@cornettoyu Input size is really my concern. For ViT based tokenizer, it has to be resize, the the output would be very blurred when resize back, I tried with Open-Magivt2, it trained with 128, encoded this image perfectly, and redecode at original flawlessly.
I am simply resize the input image and feed into model.
Yes, you were right, currently I am afraid using ViT as image tokenizer isn't a good choice, as they always constrainted by input size. OpenMagvit2, on the other hand, has no problem with it. However, their issue is, tokens are too much...
bad result also
Got an image like this
Is there a way to reconstruct it?
Currently only got: