How to extract unified repersentation for object and text caption

OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Apache License 2.0

2.43k stars 248 forks source link

How to extract unified repersentation for object and text caption #198

Open insundaycathy opened 2 years ago

insundaycathy commented 2 years ago

Thankyou for releasing this great work. I read in your paper the the image, text, and object are representated with tokens from a unified vocabulary. I was wondering how can I extracted feature representation of text captions and objects in the VG task? Because I want to do some further processing of the features. Thanks

JustinLin610 commented 2 years ago

I guess you would like to express features for each object and the text? It is not easy to do so. As you see in our framework, it is essentially a generation model, and information of different modalities interact at transformer layers. Thus it is not simple to extract independent features of each modality.