More details about VQModel used in OFA?

OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Apache License 2.0

2.39k stars 248 forks source link

More details about VQModel used in OFA? #396

Closed YAOYI626 closed 1 year ago

YAOYI626 commented 1 year ago

Hi team,

Thanks for the really amazing work OFA! I want to know more about the VQ model used in OFA.

Does it share the same VQ model when doing different tasks like captioning or generation? How is the VQ model trained? @logicwong @JustinLin610

Thanks, Xiaoyi

logicwong commented 1 year ago

@YAOYI626 Thanks for your interest.

The VQ model is exclusively employed for image infilling and generation. We discretize the raw image into a sequence of codes using the VQ model, and OFA learns to generate the codes based on the text descriptions or masked images.
For other tasks, like image captioning, we directly embed raw images into vectors via ResNet.
We utilize the pre-trained VQ model from here.

YAOYI626 commented 1 year ago

Hey @logicwong thanks for your reply!

Just curious, is there any specific reason doing captioning without VQ, Like big gap between captioning with VQ and captioning with embeddings from ResNet?

Thanks Xiaoyi

logicwong commented 1 year ago

@YAOYI626 There are two main reasons:

Discretizing images with VQ results in a loss of information from the original image. In our preliminary experiments, using VQ resulted in a significant decrease in performance for the caption and VQA tasks.
We use a compression ratio of f8 to discretize images, which means that an image of 256x256 resolution will be discretized into a sequence of codes with a length of 1024. This will increase the training cost.

YAOYI626 commented 1 year ago

Thanks @logicwong for the helpful information. I'd like to close this issue.