ecoxial2007 / LGVA_VideoQA

Language-Guided Visual Aggregation for Video Question Answering
4 stars 2 forks source link

Processing of candidate words #9

Open Yuzuriha-Inori-x opened 7 months ago

Yuzuriha-Inori-x commented 7 months ago

Hi, I have some doubts about the processing part of the candidates, can you help me?

Suppose we have 40 candidate words, which we put into the clip's text-encoder according to the approach in the paper. Do we put them into the text-encoder one by one to get a [40, 512] shaped feature? Or do we splice all the candidate words into one sentence and get a feature of shape [1, 512]?

ecoxial2007 commented 7 months ago

For processing 40 candidate words with the CLIP text encoder, the procedure is as follows:

You start with N text lists (in your case, N being 40 for the candidate words). These are first passed through a tokenizer, which converts them into a tensor of shape [N, 77]. This tensor then goes through the transformer model, resulting in a tensor of shape [N, 77, 768]. From this, we extract the [CLS] token representations, which gives us a tensor of shape [N, 768]. Finally, this tensor is processed to achieve the desired dimensionality, resulting in a final tensor of shape [N, 512].

Therefore, processing each candidate word individually or all 40 together essentially follows the same procedure and results in the same final tensor shape. Each word is handled separately within the batch, maintaining the integrity of individual word representations.

Yuzuriha-Inori-x commented 7 months ago

Thank you for your reply. I also wanted to ask, after getting R through the build_graph() operation, why should R be converted into a shape [bs 4, 4, 10, 10] through R.view(batch_size 4, 4, region_pframe region_pframe), rather than converting it to shape [bs, 16, 10, 10] through R.view(batch_size, 44, region_pframe * region_pframe).

ecoxial2007 commented 7 months ago

Sorry for the late reply. R being converted into a shape [bs 4, 4, 10, 10] is for the input to the Edge Transformer, in order to model the relationships between objects in the temporal dimension. This approach follows VGT (ECCV2022), https://github.com/sail-sg/VGT.