Open insundaycathy opened 2 years ago
I guess you would like to express features for each object and the text? It is not easy to do so. As you see in our framework, it is essentially a generation model, and information of different modalities interact at transformer layers. Thus it is not simple to extract independent features of each modality.
Thankyou for releasing this great work. I read in your paper the the image, text, and object are representated with tokens from a unified vocabulary. I was wondering how can I extracted feature representation of text captions and objects in the VG task? Because I want to do some further processing of the features. Thanks