Open GeorgeLuImmortal opened 2 years ago
Going to release those prompts, but I can inform you here right now. For visual grounding, we use 这段文字" {} "描述的是哪个区域?
, and for captioning, we use 图片描述了什么?
. You can continue using these or change something else with similar semantic meaning if you like.
Great, cheers.
Hi, the caption "这段文字" {} "描述的是哪个区域?" fails to infer location tokens, just wondering which datasets are used for pretraining OFA_CN, so I can use the data from those datasets to replicate the result, thanks.
Which checkpoint did you use? Pretrained checkpoints? How about using finetuned checkpoints to check the results of visual grounding? (Or later I can build a demo for OFA-CN)
This is also possible. The data mainly consist of the Chinese data from LAION-5B and Wukong, and some datasets translated from the English datasets, including the RefCOCO series and VG. The former ones dominate the dataset. So the prompt may not consistently encourage the model to generate location tokens. But still, I'll double check it to see if there are some other problems that I missed.
Hi, I used pretrained large checkpoint as well as the finetuned refcocog checkpoint, in both cases, the model fails in inferring location tokens.
Using pretrained large model as a demo. I randomly picked an image from Wukong dataset.
the results for image captioning is ok as shown below:
However, the results for visual grounding "这段文字" 女孩 "描述的是哪个区域?", the results are weird, the predicted token ids and deocding results are shown below:
Also, the results using refcocog finetuned model are also weird as shown below:
Hi, thanks again for contributing such good work. Just wondering have you revealed prompts(i.e., instructions) for several multi-modality tasks used in OFA-CN, especially for visual grounding task? thanks.