ecoxial2007 / LGVA_VideoQA

Language-Guided Visual Aggregation for Video Question Answering
4 stars 2 forks source link

About CLIP’s text-encoder. #8

Closed Yuzuriha-Inori-x closed 7 months ago

Yuzuriha-Inori-x commented 8 months ago

Hello, I still have some doubts about using CLIP to extract features of the problem. By modifying the original code of CLIP, we can obtain local question features with the shape [bs, 77, 512], but it is not clear how to obtain the global question features you mentioned in the paper. Can you give me some advice?

ecoxial2007 commented 8 months ago

Generally speaking, BERT utilizes the [CLS] token as a global feature, represented as [:, 0, :]. This is a relatively abstract semantic feature. If you wish to include more fine-grained information in the feature, you can also use [:, :, :].mean(dim=1). However, the difference should not be significant.

Yuzuriha-Inori-x commented 7 months ago

Thank you very much for your reply. In the meantime, I have another question that I would like your help with. In your paper, Figure 4(a) and (b) show the impact of different numbers of layers of GCA and LCA on performance respectively. I'm more curious about how the GCA and LCA layers are superimposed, and in which parts they are added!

ecoxial2007 commented 7 months ago

Generally speaking, LCA should be performed first. In fact, sometimes GCA is not necessary, as fine-grained token embeddings provide sufficient information.