OpenGVLab / VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
https://arxiv.org/abs/2303.16727
MIT License
445 stars 45 forks source link

(Feature request) Batched feature extraction #20

Closed christian-matroid closed 3 months ago

christian-matroid commented 1 year ago

Hello, thank you for releasing the code and great work!

Is there a way to increase batch size in the simple feature extraction examples? The current script only utilizes about 7gb of vram during feature extraction.

congee524 commented 1 year ago

Hello! Due to the varying length of each video, parallel processing may be troublesome and require large memory for temporary features. We may not support this feature for the time being.

christian-matroid commented 1 year ago

Hello! Thank you so much for your response! I have two follow up questions if it won't take too much time.

may be troublesome and require large memory for temporary features

If I have access to a large amount of memory, could it be as simple as just increasing the batch and reshaping these in the input tensor batch dimension?

# ...
#add some "batch_size" integer argument

for start_idx in tqdm.tqdm(start_idx_range(len(vr)), position=1, leave=0, desc=vid_name):
        data = vr.get_batch(np.arange(start_idx, start_idx + 16 * args.batch_size)).asnumpy()

        tensor_data = torch.from_numpy(data).cuda()  # Size([16*bs, 566, 320, 3])
        tdq = transform(tensor_data).unsqueeze(0)  # Size([3, 16*bs, 224, 224])
        tdq = torch.reshape(tdq, (args.batch_size, tdq.shape[1], 16, *tdq.shape[3:]))

        with torch.no_grad():
            batched_feature = model.forward_features(tdq)
            feature_list.extend(feature.cpu().numpy() for feature in batched_feature)
# ...

Additionally, when I try to load the pretrained Vit-L backbone architecture, I get numerous parameter mismatches. Is there an additional parameter I need to change to use VideoMAE (v1) model zoo models?

# ask to initialize pretrained backbone from VideoMAE model zoo
print(args.model) # vit_large_patch16_224

model = create_model(
    args.model,
    img_size=224,
    pretrained=False,
    # num_classes=710,
    all_frames=16,
    tubelet_size=2,
    drop_path_rate=0.3,
    use_mean_pooling=True,
)
ckpt = torch.load(args.ckpt_path, map_location="cpu")
for model_key in ["model", "module"]:
    if model_key in ckpt:
        ckpt = ckpt[model_key]
        break
model.load_state_dict(ckpt) #ERRORS HERE. Output is a very long torch model parameter mismatch string
congee524 commented 1 year ago

There is something wrong with the code you wrote for extracting features with a larger batch size.

Specifically, after the transform, the feature shape is [3, bs * 16, 224, 224]. This should be followed by tdq = rearrange(tdq, 'c (b t) h w -> b c t h w', b=bs, t=16). Also, your code does not consider the case where the video length is not divisible by the batch size * 16. It is recommended that if you are not sure, you still use the original code to extract the features.


For the second problem, you should remove the prefix encoder. from the model key, as in line 587: https://github.com/OpenGVLab/VideoMAEv2/blob/9492db0047a9e30446a4093543a1a39dfe62b459/run_class_finetuning.py#L582-L591 In addition, using a pre-trained model without the fine-tuning supervision of high semantic hard labels for TAD tasks can be very ineffective. So please do not use this model for feature extraction.

christian-matroid commented 1 year ago

Thank you for the quick response.

It is recommended that if you are not sure, you still use the original code to extract the features.

Thank you. I will keep this in mind.

In addition, using a pre-trained model without the fine-tuning supervision of high semantic hard labels for TAD tasks can be very ineffective.

I see. I was attempting to follow the method mentioned in the Downstream: Temporal Action Localization readme in the InternVideo repository and the InternVideo Paper (section 4.3.1). The hosted features shared there worked very well with the ActionFormer head, and I am trying to replicate their performance by extracting features on my own custom data. If additional fine-tuning was used, can you explain what that might have been?

congee524 commented 1 year ago

InternVideo also uses fine-tuned models. In fact, VideoMAE v2 and internvideo's TAD task were done by the same guy

christian-matroid commented 1 year ago

InternVideo also uses fine-tuned models

So if I want to extract features on a custom dataset with VideoMae, I should be fine-tuning a backbone model on that custom dataset, then performing feature extraction?

congee524 commented 1 year ago

InternVideo also uses fine-tuned models

So if I want to extract features on a specific dataset, I should be fine-tuning a backbone model on the dataset, then performing feature extraction?

I'm not sure, but the model finetuned on K710 should perform best (you need not perform extra supervision on your custom dataset)

christian-matroid commented 1 year ago

Hello @congee524. Thanks so much for your help so far! I've reiterated my remaining questions a little more concisely on this InternVideo issue as it is more relevant and visible. If you know more about the fine-tuning/TAL feature extraction process I would be incredibly grateful if you responded. Thanks again!

congee524 commented 1 year ago

hybrid pretrain -> k710 finetune -> extract tad features -> actionformer finish task

As far as I know this is the case

christian-matroid commented 1 year ago

@congee524 Thank you for replying. I used the ViT-Giant model finetuned on K710 to perform feature extraction on Thumos, and the exact configs (with only the input dimension changed) hosted on the internvideo github to benchmark the features with ActionFormer.

I was not able to reproduce the same results as the pre-extracted features (I got 44.77% AMaP versus the reported 71.58%). Perhaps the model used for feature extraction is finetuned on Thumos directly?

congee524 commented 1 year ago

The information is too limited for me to tell what went wrong. We have put out our own extracted features in TAD.md, perhaps you can use it to check if there is a problem with the features you extracted.

christian-matroid commented 1 year ago

@congee524 Thank you for pointing me to these hosted TAD features. I was able to reproduce your results with these features as well. 🎉

To check whether my feature extraction is performing as intended, I performed extraction with extract_tad_feature.py and a fresh installation of VideoMAEv2 using the vit_g_hybrid_pt_1200e_k710_ft.pth weights downloaded from the model weight links document that was shared with me.

python extract_tad_feature.py \
    --data_set THUMOS14 \
    --data_path raw_data/thumos14_videos/test_selection \
    --save_path sample_data/test_selection \
    --model vit_giant_patch14_224 \
    --ckpt_path models/vit_g_hybrid_pt_1200e_k710_ft.pth

I compared my features with the hosted TAD features, and I get different features than the ones hosted. The video I used to compare is video_test_0000556.mp4, downloaded directly from the official thumos dataset.

shape vit_g_k710 extracted: (504, 1408)
shape vit_g_k710 hosted: (504, 1408)
Difference between features for video_validation_000556.npy:
        total abs diff: 263538.15625
        mean abs diff: 0.3713729977607727
        std diff: 0.5049932599067688

Perhaps the model weights are different or your features were extracted with a different script?

congee524 commented 1 year ago

Thanks for the info! I'll recheck the extraction script in a few days.

christian-matroid commented 1 year ago

Hi @congee524, thank you so much for your help so far. Have you had a chance to look at the feature extraction script?

christian-matroid commented 1 year ago

Kindly bumping this again.

congee524 commented 1 year ago

Kindly bumping this again.

Sorry for my late reply, I have been rather busy recently. I briefly checked the features earlier and didn't see a problem. Do the features you extracted yourself and the features we released have the same shape?

christian-matroid commented 1 year ago

Hi @congee524, my features are of the same shape (both in frame number and dimension) but have different values. I've uploaded a few examples for direct comparison to this drive link, as well as the raw video data and the model weights of vit_g_hybrid_pt_1200e_k710_ft.pth I used for feature extraction. Let me know if you'd like me to remove the model weights.

JinChow commented 11 months ago

@congee524 Hello,have you successfully run the code of VideoMAE V2 ? I want to finetune it with my own dataset but I have met some difficulities. I would appreciate it if you can give me some advice! image