OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Apache License 2.0
2.4k stars 247 forks source link

RE Instructions used in OFA-CN #239

Open GeorgeLuImmortal opened 2 years ago

GeorgeLuImmortal commented 2 years ago

Hi, thanks again for contributing such good work. Just wondering have you revealed prompts(i.e., instructions) for several multi-modality tasks used in OFA-CN, especially for visual grounding task? thanks.

JustinLin610 commented 2 years ago

Going to release those prompts, but I can inform you here right now. For visual grounding, we use 这段文字" {} "描述的是哪个区域?, and for captioning, we use 图片描述了什么?. You can continue using these or change something else with similar semantic meaning if you like.

GeorgeLuImmortal commented 2 years ago

Great, cheers.

GeorgeLuImmortal commented 2 years ago

Hi, the caption "这段文字" {} "描述的是哪个区域?" fails to infer location tokens, just wondering which datasets are used for pretraining OFA_CN, so I can use the data from those datasets to replicate the result, thanks.

JustinLin610 commented 2 years ago

Which checkpoint did you use? Pretrained checkpoints? How about using finetuned checkpoints to check the results of visual grounding? (Or later I can build a demo for OFA-CN)

This is also possible. The data mainly consist of the Chinese data from LAION-5B and Wukong, and some datasets translated from the English datasets, including the RefCOCO series and VG. The former ones dominate the dataset. So the prompt may not consistently encourage the model to generate location tokens. But still, I'll double check it to see if there are some other problems that I missed.

GeorgeLuImmortal commented 2 years ago

Hi, I used pretrained large checkpoint as well as the finetuned refcocog checkpoint, in both cases, the model fails in inferring location tokens.

Using pretrained large model as a demo. I randomly picked an image from Wukong dataset. wukong_03

the results for image captioning is ok as shown below:

captioning

However, the results for visual grounding "这段文字" 女孩 "描述的是哪个区域?", the results are weird, the predicted token ids and deocding results are shown below:

vg

Also, the results using refcocog finetuned model are also weird as shown below:

vg1