MCG-NJU / VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
https://arxiv.org/abs/2203.12602
Other
1.39k stars 137 forks source link

could you please provide me the weight of VideoMAE pre-trained on Kinetics-400,I want to use the the weight to extract the features of the thumos14 #95

Closed Value-Jack closed 1 year ago

joaopaulq commented 1 year ago

You can use the VideoMAEModel class from Hugging Face, and set output_hidden_states=True to get the hidden states of all layers or only from the last layer.

from transformers import VideoMAEImageProcessor, VideoMAEModel
import numpy as np
import torch

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")

# Normalized video of shape (T x C x H x W).
num_frames = 16
video = list(np.random.randint(0, 256, (num_frames, 3, 224, 224)) / 255.)

pixel_values = processor(video, return_tensors="pt").pixel_values

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

with torch.no_grad():
    out = model(pixel_values, output_hidden_states=True, bool_masked_pos=bool_masked_pos)

# Output of the last layer of the model.
all_layers = out.hidden_states
# Output of each layer plus the optional initial embedding outputs.
last_layer = out.last_hidden_state

source: https://huggingface.co/docs/transformers/model_doc/videomae#videomae

yztongzhan commented 1 year ago

Hi @Value-Jack ! You can download VideoMAE features of THUMOS, ActivityNet, HACS and FineAction from this link.

Value-Jack commented 1 year ago

Hi @Value-Jack ! You can download VideoMAE features of THUMOS, ActivityNet, HACS and FineAction from this link.

could you please tell me why the dim of the channel of the thumos14 dataset is 1280? I read the videoMAE model, and the shape[1] should be 768? and Do the 1280 include the flow features? or just rgb feature? I'll appreciate it if you could answer me the questions!

Value-Jack commented 1 year ago

@yztongzhan