Closed Dotori-HJ closed 3 weeks ago
Hi,
Just follow up this question.
Following this question!
@Dotori-HJ @CrazyGeG @arushirai1 Sorry for the late reply. I hope this finds you well.
For video feature extraction, you can refer to the script from another one of our projects: extract_tad_feature.py. You just need to switch the model from VideoMAEv2 to InternVideo2. You can find the pretrained model links and configuration details for InternVideo2 here. We uniformly sample 8 frames for each sliding window input to InternVideo2.
For query feature extraction, we use the last hidden state of chinese_alpaca_lora_7b.
@yinanhe Thank you for replying. I have the following question.
I have a question about the normalization process for the model.
It seems normalization is applied in the training process using mean=0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]. <build.py in InternVideo2> https://github.com/OpenGVLab/InternVideo/blob/eca2cdc5a67d7442063d19963515b5bd0feef627/InternVideo2/single_modality/datasets/build.py#L14-L36
However, during extracting features, it seems that the input video is normalized within the 0 ~ 1 range.
<extract_tad_feature.py in VideoMAEv2>
https://github.com/OpenGVLab/VideoMAEv2/blob/29eab1e8a588d1b3ec0cdec7b03a86cca491b74b/extract_tad_feature.py#L16-L17
def to_normalized_float_tensor(vid): return vid.permute(3, 0, 1, 2).to(torch.float32) / 255
Could you clarify why there’s a difference in the normalization process between training and feature extraction, and whether this discrepancy affects the extracted features?
@Dotori-HJ Sorry, my reply is not rigorous enough and caused you trouble. In the progress of data transform, It's still need to follow the transform process of InternVideo2-CLIP. For details, you can refer to https://github.com/OpenGVLab/InternVideo/blob/eca2cdc5a67d7442063d19963515b5bd0feef627/InternVideo2/multi_modality/dataset/__init__.py#L133-L154
Thank you for the clarification and guidance. It has been very helpful!
Thank you for great work!
I am currently working on temporal action localization and planning to use InternVideo2-1B and 6B for feature extraction from raw video data that is not available on Hugging Face. However, I am unclear on the exact process about the feature extraction.
Could you please provide guidance or an example on how to extract features from raw video using InternVideo2?