OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.42k stars 85 forks source link

About feature extraction from raw video using InternVideo2 #182

Closed Dotori-HJ closed 3 weeks ago

Dotori-HJ commented 2 months ago

Thank you for great work!

I am currently working on temporal action localization and planning to use InternVideo2-1B and 6B for feature extraction from raw video data that is not available on Hugging Face. However, I am unclear on the exact process about the feature extraction.

Could you please provide guidance or an example on how to extract features from raw video using InternVideo2?

CrazyGeG commented 1 month ago

Hi,

Just follow up this question.

arushirai1 commented 3 weeks ago

Following this question!

yinanhe commented 3 weeks ago

@Dotori-HJ @CrazyGeG @arushirai1 Sorry for the late reply. I hope this finds you well.

For video feature extraction, you can refer to the script from another one of our projects: extract_tad_feature.py. You just need to switch the model from VideoMAEv2 to InternVideo2. You can find the pretrained model links and configuration details for InternVideo2 here. We uniformly sample 8 frames for each sliding window input to InternVideo2.

For query feature extraction, we use the last hidden state of chinese_alpaca_lora_7b.

Dotori-HJ commented 3 weeks ago

@yinanhe Thank you for replying. I have the following question.

I have a question about the normalization process for the model.

It seems normalization is applied in the training process using mean=0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]. <build.py in InternVideo2> https://github.com/OpenGVLab/InternVideo/blob/eca2cdc5a67d7442063d19963515b5bd0feef627/InternVideo2/single_modality/datasets/build.py#L14-L36

However, during extracting features, it seems that the input video is normalized within the 0 ~ 1 range. <extract_tad_feature.py in VideoMAEv2> https://github.com/OpenGVLab/VideoMAEv2/blob/29eab1e8a588d1b3ec0cdec7b03a86cca491b74b/extract_tad_feature.py#L16-L17 def to_normalized_float_tensor(vid): return vid.permute(3, 0, 1, 2).to(torch.float32) / 255

Could you clarify why there’s a difference in the normalization process between training and feature extraction, and whether this discrepancy affects the extracted features?

yinanhe commented 3 weeks ago

@Dotori-HJ Sorry, my reply is not rigorous enough and caused you trouble. In the progress of data transform, It's still need to follow the transform process of InternVideo2-CLIP. For details, you can refer to https://github.com/OpenGVLab/InternVideo/blob/eca2cdc5a67d7442063d19963515b5bd0feef627/InternVideo2/multi_modality/dataset/__init__.py#L133-L154

Dotori-HJ commented 3 weeks ago

Thank you for the clarification and guidance. It has been very helpful!