Open svenssona opened 1 year ago
Hi @svenssona were u able to extract the VideoMAE features? Thanks
@svenssona Also curious if you were able to solve this. This seems like a pretty crucial part of this repo, I'm not sure why it is not included in a script.
Update: there is a script in the VideoMAE-V2 repository, but it does not support batched inference, so feature extraction is quite slow.
Hi,
any update on this issue?
I think this script is enough to extract feats:
import os
import torch
import decord
import InternVideo
import numpy as np
from tqdm import tqdm
from pathlib import Path
from torchvision import transforms
from InternVideo import video_transform
def get_video_batch(video_reader: decord.VideoReader, indices: np.array):
video = video_reader.get_batch(indices).byte()
video = video.permute(3, 0, 1, 2)
input_mean = [0.48145466, 0.4578275, 0.40821073]
input_std = [0.26862954, 0.26130258, 0.27577711]
crop_size, scale_size = 224, 256
trans = transforms.Compose([
video_transform.TensorToNumpy(),
video_transform.Resize(scale_size),
video_transform.CenterCrop(crop_size),
video_transform.ClipToTensor(channel_nb=3),
video_transform.Normalize(mean=input_mean, std=input_std)
])
video = trans(video)
return video
def find_latest_divisable(number: int, division: int = 8):
while number % division != 0:
number -= 1
return number
def extract_feat(video_path: str):
video_reader = decord.VideoReader(video_path, num_threads=1, ctx=decord.cpu(0))
decord.bridge.set_bridge('torch')
stride = 128
video_features = []
for c, start_index in enumerate(range(0, len(video_reader) - stride, stride)):
video = get_video_batch(video_reader, np.array([i for i in range(start_index, start_index + stride)])).cuda()
with torch.no_grad():
feat = model.encode_video(video.unsqueeze(0))
video_features.append(feat.cpu())
remaning_indices = np.array([i for i in range(start_index + stride, find_latest_divisable(len(video_reader)))])
if len(remaning_indices) > 8:
video = get_video_batch(video_reader, remaning_indices).cuda()
with torch.no_grad():
feat = model.encode_video(video.unsqueeze(0))
video_features.append(feat.cpu())
return torch.vstack(video_features)
if __name__ == "__main__":
video_folder = "/tmp/videos"
save_folder = "/tmp/features"
model = InternVideo.load_model("./models/InternVideo-MM-B-16.ckpt").cuda()
video_list = list(Path(video_folder).rglob("*.mp4"))
for video_path in tqdm(video_list, total=len(video_list), desc="extracting InterVideo features"):
video_path = str(video_path)
save_path = video_path.replace(video_folder, save_folder).replace(".mp4", ".pt")
assert video_path != save_path
Path(save_path).parent.mkdir(exist_ok=True, parents=True)
if os.path.exists(save_path): continue
feat = extract_feat(video_path)
torch.save(feat, save_path)
There is one point to confuse me. In paper, number of video frame is given 16, however encoder uses 8 in the code. Above code uses 8 as number of video frames.
Stride acts like batch. Hence, it can be changed according to GPU.
Hi @EmreOzkose . I wanted to ask you where did you find the model used here "./models/InternVideo-MM-B-16.ckpt" ?
Hello, thanks for releasing the code of this cool paper!
I would like to try the Temporal Action Localization on my own custom data. I have generated raw_frames for each video. But I struggle to understand how I extract the features from the images? How do I get the feature vector of 1280 from the VideoMAE that you extracted from the Thumos dataset for example?
Any help would be kindly appreciated!