ecoxial2007 / LGVA_VideoQA

Language-Guided Visual Aggregation for Video Question Answering
4 stars 2 forks source link

Clarification on Feature Representations from text_features_all.h5 and text_features_clip.h5 #11

Open Khadgar123 opened 4 months ago

Khadgar123 commented 4 months ago

I'm currently working with the datasets stored in text_features_all.h5 and text_features_clip.h5 and have come across three specific features extracted from these files: text_query_features, text_query_token_features, and text_cands_features.

Could you provide a detailed explanation of what each of these three features represents within the context of the data? Specifically, how do text_query_features、text_query_token_features and text_cands_features differ in their representation of text data, and what role does each play in the overall model?

I'm also curious about the extraction process for these features. Thank you very much for your support.

ecoxial2007 commented 4 months ago

All three features are visual features extracted by OpenAI's CLIP. The text_query_features and text_query_token_features are global and local features of the query, respectively, with dimensions of 1x512 and 77x768. The text_cands_features represent the global features of the answer with a dimension of 1x512. The text_query_token_features are inputs for the LCA, while the text_query_features are inputs for the GCA. The text_cands_features are used to calculate similarity and loss. For more details, please refer to the paper.

Khadgar123 commented 4 months ago

Is my understanding correct that text_query_token_features are generated by applying CLIP to extract features for each individual word within the textual query? If so, does this process involve treating each word as a separate input to CLIP, thereby generating distinct feature vectors for each word, which are then aggregated to form the text_query_token_features with dimensions of 77x768?

ecoxial2007 commented 4 months ago

If you've checked the source code of OpenAI's CLIP, you'd find that regardless of the length of the input sentence, it's transformed into 77 tokens after tokenization. This process involves segmentation, and you can refer to the vocabulary used by CLIP for details. Also, the tokens are not entirely independent; they undergo encoding through a transformer model.

Khadgar123 commented 4 months ago

While analyzing the codebase of our project, I encountered a point of confusion regarding the dimension selection operation on bbox_features. rFeature = item_dict['bbox_features'][:, :, 0, :, :] This line appears to select a specific slice from a higher-dimensional data structure. Based on my understanding of the subsequent processing, the shape of the features after selection is (B, L, M, D_in), where B stands for batch size, L for sequence length, M for the number of region proposals, and D_in for the dimensionality of the features. What I am seeking clarity on is what the original shape of bbox_features is before this dimension selection operation takes place. Can you explain what the third dimension specifically represents and why there's a selection using [:, :, 0, :, :].

ecoxial2007 commented 4 months ago

This is due to our initial experimental procedure where we extracted CLIP features for each bbox label (like cat or dog), stored at position M=1; position M=0 corresponds to the visual features of the region. However, during the actual experimental process, we discovered that many of these labels were incorrect. As a result, we did not utilize them and focused exclusively on the region's visual features