How do I apply to my video files?

rajeevchhabra commented 6 years ago

Hi: I have been able to run your algorithm on my machine (both training and test datasets). Now I would like to apply it to my dataset (my videos - they are not compressed to .h5). How do I do that? What function would I need to modify? Please guide.

KaiyangZhou commented 6 years ago

Hi @rajeevchhabra, You can replace this line https://github.com/KaiyangZhou/vsumm-reinforce/blob/master/vsum_train.py#L84 with your own features, which should be of dimension (num_frames, feature_dim). You also need to modify https://github.com/KaiyangZhou/vsumm-reinforce/blob/master/vsum_train.py#L69 to store baseline rewards according to your own datasets.

zijunwei commented 6 years ago

Hi, is that possible to share the code you create the h5 dataset so that I can follow to create my own? It doesn't even have to be runable Thanks!

KaiyangZhou commented 6 years ago

Hi @zijunwei, You can follow the code below to create your own data:

import h5py
h5_file_name = 'blah blah blah'
f = h5py.File(h5_file_name, 'w')

# video_names is a list of strings containing the 
# name of a video, e.g. 'video_1', 'video_2'
for name in video_names:
    f.create_dataset(name + '/features', data=data_of_name)
    f.create_dataset(name + '/gtscore', data=data_of_name)
    f.create_dataset(name + '/user_summary', data=data_of_name)
    f.create_dataset(name + '/change_points', data=data_of_name)
    f.create_dataset(name + '/n_frame_per_seg', data=data_of_name)
    f.create_dataset(name + '/n_frames', data=data_of_name)
    f.create_dataset(name + '/picks', data=data_of_name)
    f.create_dataset(name + '/n_steps', data=data_of_name)
    f.create_dataset(name + '/gtsummary', data=data_of_name)
    f.create_dataset(name + '/video_name', data=data_of_name)

f.close()

For a detailed description of the data format, please refer to the readme.txt in dataset which you downloaded via wget.

Instructions for h5py can be found at http://docs.h5py.org/en/latest/quick.html

Let me know if you have any problems.

zijunwei commented 6 years ago

Thanks! For the readme.txt file you referred.

/key
    /features                 2D-array with shape (n_steps, feature-dimension)
    /gtscore                  1D-array with shape (n_steps), stores ground truth improtance score
    /user_summary             2D-array with shape (num_users, n_frames), each row is a binary vector
    /change_points            2D-array with shape (num_segments, 2), each row stores indices of a segment
    /n_frame_per_seg          1D-array with shape (num_segments), indicates number of frames in each segment
    /n_frames                 number of frames in original video
    /picks                    posotions of subsampled frames in original video
    /n_steps                  number of subsampled frames
    /gtsummary                1D-array with shape (n_steps), ground truth summary provided by user
    /video_name (optional)    original video name, only available for SumMe dataset

How the gtscore is computed and how is it different from gtsummary or the average of user_summary? I didn't see you using gtscore or gtsummary in testing, just ask out of curiosity. Thanks!

KaiyangZhou commented 6 years ago

gtscore and gtsummary are used for training only. I should have clarified this.

gtscore is the average of multiple importance scores (used by regression loss). gtsummary is a binary vector indicating indices of keyframes, and is provided by original datasets as well (this label can be used for maximum likelihood loss).

user_summary contains multiple key-clips given by human annotators and we need to compare our machine summary with each one of the user summaries.

Hope this clarifies.

zijunwei commented 6 years ago

Thanks! It's very helpful!

chandra-siri commented 6 years ago

@KaiyangZhou I'm trying to create .h5py file for my own video. After reading the 'datasets/readme.txt' I understood that I need data like..features, n_frames, n_picks, n_steps. ( I could only understand what n_frames are :| )

But what exactly is features. I understand that it'll be a numpy matrix of shape (n_steps, feature_dimension). But what are these and how do I extract them for a given video frames ? Could you please give me more description about them

I've glanced through you paper, but I couldn't find about these .

KaiyangZhou commented 6 years ago

Hi @chandra-siri,

features contains feature vectors representing video frames. Each video frame can be represented by a feature vector (containing some semantic meanings), extracted by a pretrained convolutional neural network (e.g. GoogLeNet). picks is an array storing the position information of subsampled video frames. We do not process each video frame since adjacent frames are very similar. We can subsample a video with 2 frame per second or 1 frame per second, which will result in less frames but they are informative. picks is useful when we want to interpolate the subsampled frames into the original video (say you have obtained importance scores for subsampled frames and you want to get the scores for the entire video, picks can indicate which frames are scored and the scores of surrounding frames can be filled with these frames).

how do I extract them for a given video frames?

You can use off-the-shelf feature extractors to achieve this, e.g. pytorch. First, load the feature extractor, e.g. a pretrained neural network. Second, loop into each video frame and use the feature extractor to extract features from those frames. Each frame will be represented by a long feature vector. If you use GoogLeNet, you will end up with 1024-dimensional feature vector. Third, concatenate the extracted features to form a feature matrix, and save it to the h5 file as specified in the readme.txt.

The pseudo code below might be more clear:

features = []
for frame in video_frames:
    # frame is a numpy array of shape (channel, height, width)
    # do some preprocessing such as normalization
    frame = preprocess(frame)
    # apply the feature extractor to this frame
    feature = feature_extractor(frame)
    # save the feature
    features.append(feature)
features = concatenate(features) # now shape is (n_steps, feature_dimension)

Hope this would help.

chandra-siri commented 6 years ago

@KaiyangZhou This is very informative and helpful. I'll try out what you've mention, using googleNet (inception model v-3) and let you know. Thanks a lot !

chandra-siri commented 6 years ago

@KaiyangZhou As you told I've was able to extract frames. But in order to to get summary I also need change_points . Could you tell me what is change_points and also what is num_segments

KaiyangZhou commented 6 years ago

@chandra-siri change_points corresponds to shot transitions, which are obtained by temporal segmentation approaches that segment a video into disjoint shots. num_segments is number of total segments a video is cut into. Please refer to this paper and this paper if you are unfamiliar with the pipeline.

Specifically, change_points look like

change_points = [
    0, 10;
    11, 20;
    21, 30;
]

This means the video is segmented into three parts. The first part ranges from frame 0 to frame 10, the second part ranges from frame 11 to frame 20, and so on and forth.

samrat1997 commented 6 years ago

How do I know which key in dataset corresponds to which video in SumMe dataset ?

KaiyangZhou commented 6 years ago

@samrat1997 SumMe: video name is stored in video_i/video_name. TVSum: video1-50 corresponds to the same order in ydata-tvsum50.mat, which is the original matlab file provided by TVSum.

samrat1997 commented 6 years ago

@KaiyangZhou ... Thank you. I just realized that.

harora commented 6 years ago

@KaiyangZhou Hi . I've been trying to use the code to test on my dataset. I used the google inception v3 pretrained pytorch model to generate features and it has 1000 classes output. Hence my features shape is (num_frames,1000). However dataset used here has output 1024. Can you help regarding this? Will i have to modify and retrain inception model?

KaiyangZhou commented 6 years ago

@harora the feature dimension does not matter, you can just feed (num_features, any_num_dim) to the algorithm, you don't need to retrain the model

it is strange to use the class prelogits as feature vectors, it would make more sense to use the layer before softmax, e.g. 1024-dim for googlenet, 2048 for resnet

Petersteve commented 6 years ago

@KaiyangZhou hi, did we generate change_points manually ? if not show me the code associated.

gtscore is it generated by the user manually? if not show me the code associated.

KaiyangZhou commented 6 years ago

@bersalimahmoud change_points are obtained by temporal segmentation method. gtscore is the average of human scores, so it can be used for supervised training (you won't need this anyway).

liuhaixiachina commented 6 years ago

@KaiyangZhou regarding Visualize summary, in readme, it says:

Visualize summary

You can use summary2video.py to transform the binary machine_summary to real summary video. You need to have a directory containing video frames. The code will automatically write summary frames to a video where the frame rate can be controlled. Use the following command to generate a .mp4 video

Where or how can I get the frames?

can i get frames from the .h5 files? or, shall I create frames from the raw videos?

Thank you very much !

KaiyangZhou commented 6 years ago

@liuhaixiachina you need to decompose a video before doing other things e.g. feature extraction. You can use ffmpeg or python to do it.

babyjie57 commented 6 years ago

@KaiyangZhou Hi, I am trying to use the code to test on my own video. I used a pretrained model to generate features and it has 4096 classes output. I see you said " the feature dimension does not matter" in the above. But, I got "RuntimeError: input.size(-1) must be equal to input_size. Expected 1024, got 4096".

Could you please tell me how to solve this issue?

Thanks a lot!

KaiyangZhou commented 6 years ago

@babyjie57 you need to change the argument input_dim=4096

babyjie57 commented 6 years ago

@KaiyangZhou Thanks for your reply. I also added '--input-dim 4096', but I got 'While copying the parameter named "rnn.weight_ih_l0_reverse", whose dimensions in the model are torch.Size([1024, 4096]) and whose dimensions in the checkpoint are torch.Size([1024, 1024]).'

Can you please tell me how to solve this issue?

Thanks!

KaiyangZhou commented 6 years ago

I presume you are loading a model which was trained with features of 1024 dimensions but initialized with feature dimension = 4096.

mctigger commented 6 years ago

Can you also publish the script for the KTS you used to generate the change points?

KaiyangZhou commented 6 years ago

@Mctigger you can find the code here http://lear.inrialpes.fr/people/potapov/med_summaries.php

HrsPythonix commented 6 years ago

I got a question, if I want to use my own dataset, but there is no label in the dataset, when I construct the hdf5 file, what should I do with user_summary, gts_score and gtsummary?

Also I see these three labels only used in evaluation process, does this means that I can just delete them both in hdf5 and the evaluation function? (I mean in the pytorch implementation)

Moreover, if I want to use the result.json to generate a summarization video for a raw video, can I delete these three labels?

MuziSakura commented 6 years ago

Have you solved this problem? I want to use my own video data but I don't know how to deal with user_summary, gts_score and gtsummary.

anuragshas commented 5 years ago

How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ?

gh2517956473 commented 5 years ago

@KaiyangZhou How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ? Did you use CNN features for KTS ? Are the CNN features subsampled and extracted from a video with 2 frame per second or 1 frame per second? Could you please share your code with me. Thank you very much !

KaiyangZhou commented 5 years ago

How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ?

You can decompose a video using either ffmpeg or opencv. For the latter, there is an example code on the opencv website. You can write sth like

import numpy as np
import cv2

cap = cv2.VideoCapture(0)
video_features = []

while(still_has_frame):
    # Capture frame-by-frame
    ret, frame = cap.read()
    # maybe skip this frame for downsampling
    # feature extraction
    feature = feature_extractor(frame) # or perform extraction on minibatch which leverages gpu
    # store feature
    video_features.append(feature)

summary = video_summarizer(video_features)

Did you use CNN features for KTS ? Are the CNN features subsampled and extracted from a video with 2 frame per second or 1 frame per second?

Yes. You can use CNN features which capture high-level semantics. Downsampling is a common technique as neighbour frames are redundant. 2fps/1fps is good.

KaiyangZhou commented 5 years ago

Annotations are not required for training. Only frame features are required by the algorithm. You can qualitatively evaluate the results by applying the trained model to unseen videos and watch the summaries.

gh2517956473 commented 5 years ago

@KaiyangZhou Could you tell me where to download the original video（SumMe and TVSum）? Thank you very much !

loveFaFa commented 5 years ago

Could you tell me where to download the original video（SumMe and TVSum）? Thank you very much !

bdgp01 commented 5 years ago

Same question as above.

Could you please tell me where to download the original video（SumMe and TVSum）? Thank you very much in advance !

rajlakshmi123 commented 5 years ago

@KaiyangZhou Can you please tell me what is picks, how can we calculate it. What is it's dimensions?

wjb123 commented 5 years ago

@KaiyangZhou, how to use KTS to generate change points ? I use the official KTS code and employ CNN feature for each frame, but get same number of segments for every video. Is there any problem?

SinDongHwan commented 5 years ago

@KaiyangZhou To get change points, Does frames of a video input to X in "demo.py"? or Does features of each frames input?

chenchch94 commented 5 years ago

@wjb123Hi, Do you solve this problem???→ "how to use KTS to generate change points ? I use the official KTS code and employ CNN feature for each frame, but get same number of segments for every video. Is there any problem?"

SinDongHwan commented 5 years ago

@chenchch94 this is my repository of forked. you can find generate_dataset.py in "utils" directory. Good Luck!

neda60 commented 4 years ago

Hi, I could generate the .h5 file for my own dataset, however, my dataset has no annotations. Is it possible to use your code without annotated videos? If so, how? Thanks!

SinDongHwan commented 4 years ago

Hi, @neda60 For Train, Evaluation, It's not possible. Bus just for test, It's possible. To test, you need "features", "picks", "n_frame", "change_points", "n_frame_per_seg".

anaghazachariah commented 4 years ago

@KaiyangZhou Can you please share the code for .h5 file..How to deal with gtscore,gtsummary and usersummary?

vb637 commented 3 years ago

Hello，I want to use your RL code to extract key frames. Now I use a complex network to extract features, and store it in .h5 file. But i didn't have other attribute such as gtscore and gtsummary( Because I guess dataset at least has these three attribute). Now, I try to create gtscore by creating an all one numpy array, but I don't know whether this is right or wrong. If wrong, how can I compute gtscore. Meanwhile, I create gtsummary by random sampling some frames, should I uniformly sampling?

Fredham commented 2 years ago

@liuhaixiachina you need to decompose a video before doing other things e.g. feature extraction. You can use ffmpeg or python to do it.

I followed steps mentioned in README. It doesn't have video_frames neither. Shall I create frames from the raw videos?Is there any missing steps in README?How could I decompose video using ffmpeg or python?But there is no video in datasets.I also read the code of summary2video.py.Should I decompose "result.h5"

Fredham commented 2 years ago

Hi: I have been able to run your algorithm on my machine (both training and test datasets). Now I would like to apply it to my dataset (my videos - they are not compressed to .h5). How do I do that? What function would I need to modify? Please guide. in readme, it says: _Visualize summary You can use summary2video.py to transform the binary machinesummary to real summary video. You need to have a directory containing video frames. The code will automatically write summary frames to a video where the frame rate can be controlled. Use the following command to generate a .mp4 video

Where or how can I get the frames?I followed steps mentioned in README. It doesn't have video_frames neither. Shall I create frames from the raw videos?Is there any missing steps in README?How could I decompose video using ffmpeg or python?But there is no video in datasets.I also read the code of summary2video.py.Should I decompose "result.h5"?

Fredham commented 2 years ago

@KaiyangZhou How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ? Did you use CNN features for KTS ? Are the CNN features subsampled and extracted from a video with 2 frame per second or 1 frame per second? Could you please share your code with me. Thank you very much !

_Visualize summary You can use summary2video.py to transform the binary machinesummary to real summary video. You need to have a directory containing video frames. The code will automatically write summary frames to a video where the frame rate can be controlled. Use the following command to generate a .mp4 video

Where or how can I get the frames?I followed steps mentioned in README. It doesn't have video_frames neither. Shall I create frames from the raw videos?Is there any missing steps in README?How could I decompose video using ffmpeg or python?But there is no video in datasets.I also read the code of summary2video.py.Should I decompose "result.h5"?

KaiyangZhou / vsumm-reinforce

How do I apply to my video files? #1

Visualize summary