OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.29k stars 83 forks source link

Extract features from custom data. #8

Open svenssona opened 1 year ago

svenssona commented 1 year ago

Hello, thanks for releasing the code of this cool paper!

I would like to try the Temporal Action Localization on my own custom data. I have generated raw_frames for each video. But I struggle to understand how I extract the features from the images? How do I get the feature vector of 1280 from the VideoMAE that you extracted from the Thumos dataset for example?

Any help would be kindly appreciated!

abhinine4 commented 1 year ago

Hi @svenssona were u able to extract the VideoMAE features? Thanks

christian-matroid commented 1 year ago

@svenssona Also curious if you were able to solve this. This seems like a pretty crucial part of this repo, I'm not sure why it is not included in a script.

Update: there is a script in the VideoMAE-V2 repository, but it does not support batched inference, so feature extraction is quite slow.

EmreOzkose commented 5 months ago

Hi,

any update on this issue?

EmreOzkose commented 5 months ago

I think this script is enough to extract feats:

import os
import torch
import decord
import InternVideo
import numpy as np

from tqdm import tqdm
from pathlib import Path
from torchvision import transforms
from InternVideo import video_transform

def get_video_batch(video_reader: decord.VideoReader, indices: np.array):
    video = video_reader.get_batch(indices).byte()
    video = video.permute(3, 0, 1, 2)
    input_mean = [0.48145466, 0.4578275, 0.40821073]
    input_std = [0.26862954, 0.26130258, 0.27577711]
    crop_size, scale_size = 224, 256
    trans = transforms.Compose([
        video_transform.TensorToNumpy(),
        video_transform.Resize(scale_size),
        video_transform.CenterCrop(crop_size),
        video_transform.ClipToTensor(channel_nb=3),
        video_transform.Normalize(mean=input_mean, std=input_std)
    ])

    video = trans(video)
    return video

def find_latest_divisable(number: int, division: int = 8):
    while number % division != 0:
        number -= 1
    return number

def extract_feat(video_path: str):
    video_reader = decord.VideoReader(video_path, num_threads=1, ctx=decord.cpu(0))
    decord.bridge.set_bridge('torch')

    stride = 128

    video_features = []
    for c, start_index in enumerate(range(0, len(video_reader) - stride, stride)):
        video = get_video_batch(video_reader, np.array([i for i in range(start_index, start_index + stride)])).cuda()
        with torch.no_grad():
            feat = model.encode_video(video.unsqueeze(0))
            video_features.append(feat.cpu())

    remaning_indices = np.array([i for i in range(start_index + stride, find_latest_divisable(len(video_reader)))])
    if len(remaning_indices) > 8:
        video = get_video_batch(video_reader, remaning_indices).cuda()
        with torch.no_grad():
            feat = model.encode_video(video.unsqueeze(0))
            video_features.append(feat.cpu())

    return torch.vstack(video_features)

if __name__ == "__main__":
    video_folder = "/tmp/videos"
    save_folder = "/tmp/features"

    model = InternVideo.load_model("./models/InternVideo-MM-B-16.ckpt").cuda()

    video_list = list(Path(video_folder).rglob("*.mp4"))
    for video_path in tqdm(video_list, total=len(video_list), desc="extracting InterVideo features"):
        video_path = str(video_path)
        save_path = video_path.replace(video_folder, save_folder).replace(".mp4", ".pt")
        assert video_path != save_path
        Path(save_path).parent.mkdir(exist_ok=True, parents=True)
        if os.path.exists(save_path): continue

        feat = extract_feat(video_path)
        torch.save(feat, save_path)

There is one point to confuse me. In paper, number of video frame is given 16, however encoder uses 8 in the code. Above code uses 8 as number of video frames.

Stride acts like batch. Hence, it can be changed according to GPU.

mohamedali05 commented 3 months ago

Hi @EmreOzkose . I wanted to ask you where did you find the model used here "./models/InternVideo-MM-B-16.ckpt" ?

EmreOzkose commented 2 months ago

Hi, It is in here

https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/pretrain/InternVideo-MM-B-16.ckpt