Yuqifan1117 / CaCao

This is the official repository for the paper "Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World" (Accepted by ICCV 2023)
40 stars 5 forks source link

some questions about EPIC #18

Closed ENDGU closed 7 months ago

ENDGU commented 7 months ago

Hello, sorry to bother you again.I am very interested in the EPIC module section of your paper and have been trying to reproduce it recently.But I encountered some difficulties.In the image input section, you used three images in the Figure 3, representing union region, sub, and obj. May I ask: 1: Should the corresponding images be pre cut or how to obtain them? 2: Assuming the image has been cropped, is the CLIP model used to process the three images and then cat them as embeddings or simply add them together?

Yuqifan1117 commented 7 months ago
  1. we used the coordinates from the corresponding proposals and cropped the subject region as well as the object region for the next steps (We employ union regions for illustration purposes. In fact, we instead utilize a union feature of two corresponding regions considering the time complexity.)
  2. We resize the two cropped images and input them into the CLIP model. After obtaining the image features, we add them together as the final image features to learn the predicate representation.
ENDGU commented 7 months ago

Thank you very much for your patient reply, which has helped me a lot!

ENDGU commented 7 months ago

Regarding the aspect of text processing, your Figure 3 mentions that text input has a [CLS] and [MASK] special flag, which does not seem to be included in the CLIP model's text processing. Are you using the BERT model's tokenizer to embed text?

Yuqifan1117 commented 7 months ago

'[CLS] a photo of human [MASK] computer' represents the template triplet. To learn the predicate representation, here we use [MASK] special flag to represent unseen predicates, which should be replaced with the corresponding text. [CLS] represents the CLIP model’s pooled outputs and doesn't need input. So an example input would be 'a photo of human manipulation computer'.

ENDGU commented 7 months ago

The following text mentions that the predicate uses a Glove vector for embedding. Does it mean that in the input sentence, only the predicate uses a Glove vector or is it the entire sentence?

Yuqifan1117 commented 7 months ago

It seems to use a clip tokenizer for the embedding. (Is that you're asking)

ENDGU commented 7 months ago
image

My confusion is how the Glove vector mentioned here is used.This paragraph comes from Appendix F of the paper.

Yuqifan1117 commented 7 months ago

We have already updated several confusing mentions in the newest version (v2 in Aug). image

ENDGU commented 7 months ago

Thank you very much!!!!!

ENDGU commented 7 months ago
image

The CLIP model text embedding vector that I previously learned about seems to be [batch_size,77,512], and the image embedding is [batch_size,50,512]? What do the lengths 4 and 2 in the picture mean?

Yuqifan1117 commented 7 months ago

The embedding is shaped as [batch_size, seq_length, embedding_size], where the lengths 4 and 2 are incorporated into the seq_length dimension.

ENDGU commented 7 months ago

Thank you, I'll try again.