Closed Yuzuriha-Inori-x closed 7 months ago
Generally speaking, BERT utilizes the [CLS] token as a global feature, represented as [:, 0, :]. This is a relatively abstract semantic feature. If you wish to include more fine-grained information in the feature, you can also use [:, :, :].mean(dim=1). However, the difference should not be significant.
Thank you very much for your reply. In the meantime, I have another question that I would like your help with. In your paper, Figure 4(a) and (b) show the impact of different numbers of layers of GCA and LCA on performance respectively. I'm more curious about how the GCA and LCA layers are superimposed, and in which parts they are added!
Generally speaking, LCA should be performed first. In fact, sometimes GCA is not necessary, as fine-grained token embeddings provide sufficient information.
Hello, I still have some doubts about using CLIP to extract features of the problem. By modifying the original code of CLIP, we can obtain local question features with the shape [bs, 77, 512], but it is not clear how to obtain the global question features you mentioned in the paper. Can you give me some advice?