Closed rajeevchhabra closed 6 years ago
Hi @rajeevchhabra, You can replace this line https://github.com/KaiyangZhou/vsumm-reinforce/blob/master/vsum_train.py#L84 with your own features, which should be of dimension (num_frames, feature_dim). You also need to modify https://github.com/KaiyangZhou/vsumm-reinforce/blob/master/vsum_train.py#L69 to store baseline rewards according to your own datasets.
Hi, is that possible to share the code you create the h5 dataset so that I can follow to create my own? It doesn't even have to be runable Thanks!
Hi @zijunwei, You can follow the code below to create your own data:
import h5py
h5_file_name = 'blah blah blah'
f = h5py.File(h5_file_name, 'w')
# video_names is a list of strings containing the
# name of a video, e.g. 'video_1', 'video_2'
for name in video_names:
f.create_dataset(name + '/features', data=data_of_name)
f.create_dataset(name + '/gtscore', data=data_of_name)
f.create_dataset(name + '/user_summary', data=data_of_name)
f.create_dataset(name + '/change_points', data=data_of_name)
f.create_dataset(name + '/n_frame_per_seg', data=data_of_name)
f.create_dataset(name + '/n_frames', data=data_of_name)
f.create_dataset(name + '/picks', data=data_of_name)
f.create_dataset(name + '/n_steps', data=data_of_name)
f.create_dataset(name + '/gtsummary', data=data_of_name)
f.create_dataset(name + '/video_name', data=data_of_name)
f.close()
For a detailed description of the data format, please refer to the readme.txt
in dataset
which you downloaded via wget
.
Instructions for h5py can be found at http://docs.h5py.org/en/latest/quick.html
Let me know if you have any problems.
Thanks! For the readme.txt file you referred.
/key
/features 2D-array with shape (n_steps, feature-dimension)
/gtscore 1D-array with shape (n_steps), stores ground truth improtance score
/user_summary 2D-array with shape (num_users, n_frames), each row is a binary vector
/change_points 2D-array with shape (num_segments, 2), each row stores indices of a segment
/n_frame_per_seg 1D-array with shape (num_segments), indicates number of frames in each segment
/n_frames number of frames in original video
/picks posotions of subsampled frames in original video
/n_steps number of subsampled frames
/gtsummary 1D-array with shape (n_steps), ground truth summary provided by user
/video_name (optional) original video name, only available for SumMe dataset
How the gtscore is computed and how is it different from gtsummary or the average of user_summary? I didn't see you using gtscore or gtsummary in testing, just ask out of curiosity. Thanks!
gtscore
and gtsummary
are used for training only. I should have clarified this.
gtscore
is the average of multiple importance scores (used by regression loss). gtsummary
is a binary vector indicating indices of keyframes, and is provided by original datasets as well (this label can be used for maximum likelihood loss).
user_summary
contains multiple key-clips given by human annotators and we need to compare our machine summary with each one of the user summaries.
Hope this clarifies.
Thanks! It's very helpful!
@KaiyangZhou I'm trying to create .h5py file for my own video. After reading the 'datasets/readme.txt' I understood that I need data like..features, n_frames, n_picks, n_steps. ( I could only understand what n_frames are :| )
But what exactly is features. I understand that it'll be a numpy matrix of shape (n_steps, feature_dimension). But what are these and how do I extract them for a given video frames ? Could you please give me more description about them
I've glanced through you paper, but I couldn't find about these .
Hi @chandra-siri,
features
contains feature vectors representing video frames. Each video frame can be represented by a feature vector (containing some semantic meanings), extracted by a pretrained convolutional neural network (e.g. GoogLeNet). picks
is an array storing the position information of subsampled video frames. We do not process each video frame since adjacent frames are very similar. We can subsample a video with 2 frame per second or 1 frame per second, which will result in less frames but they are informative. picks
is useful when we want to interpolate the subsampled frames into the original video (say you have obtained importance scores for subsampled frames and you want to get the scores for the entire video, picks
can indicate which frames are scored and the scores of surrounding frames can be filled with these frames).
how do I extract them for a given video frames?
You can use off-the-shelf feature extractors to achieve this, e.g. pytorch. First, load the feature extractor, e.g. a pretrained neural network. Second, loop into each video frame and use the feature extractor to extract features from those frames. Each frame will be represented by a long feature vector. If you use GoogLeNet, you will end up with 1024-dimensional feature vector. Third, concatenate the extracted features to form a feature matrix, and save it to the h5 file as specified in the readme.txt.
The pseudo code below might be more clear:
features = []
for frame in video_frames:
# frame is a numpy array of shape (channel, height, width)
# do some preprocessing such as normalization
frame = preprocess(frame)
# apply the feature extractor to this frame
feature = feature_extractor(frame)
# save the feature
features.append(feature)
features = concatenate(features) # now shape is (n_steps, feature_dimension)
Hope this would help.
@KaiyangZhou This is very informative and helpful. I'll try out what you've mention, using googleNet (inception model v-3) and let you know. Thanks a lot !
@KaiyangZhou As you told I've was able to extract frames. But in order to to get summary I also need change_points
. Could you tell me what is change_points
and also what is num_segments
@chandra-siri
change_points
corresponds to shot transitions, which are obtained by temporal segmentation approaches that segment a video into disjoint shots. num_segments
is number of total segments a video is cut into. Please refer to this paper and this paper if you are unfamiliar with the pipeline.
Specifically, change_points
look like
change_points = [
0, 10;
11, 20;
21, 30;
]
This means the video is segmented into three parts. The first part ranges from frame 0 to frame 10, the second part ranges from frame 11 to frame 20, and so on and forth.
How do I know which key in dataset corresponds to which video in SumMe dataset ?
@samrat1997
SumMe: video name is stored in video_i/video_name
.
TVSum: video1-50 corresponds to the same order in ydata-tvsum50.mat
, which is the original matlab file provided by TVSum.
@KaiyangZhou ... Thank you. I just realized that.
@KaiyangZhou Hi . I've been trying to use the code to test on my dataset. I used the google inception v3 pretrained pytorch model to generate features and it has 1000 classes output. Hence my features shape is (num_frames,1000). However dataset used here has output 1024. Can you help regarding this? Will i have to modify and retrain inception model?
@harora the feature dimension does not matter, you can just feed (num_features, any_num_dim) to the algorithm, you don't need to retrain the model
it is strange to use the class prelogits as feature vectors, it would make more sense to use the layer before softmax, e.g. 1024-dim for googlenet, 2048 for resnet
@KaiyangZhou hi, did we generate change_points manually ? if not show me the code associated.
gtscore is it generated by the user manually? if not show me the code associated.
@bersalimahmoud change_points
are obtained by temporal segmentation method. gtscore
is the average of human scores, so it can be used for supervised training (you won't need this anyway).
@KaiyangZhou regarding Visualize summary, in readme, it says:
You can use summary2video.py
to transform the binary machine_summary
to real summary video. You need to have a directory containing video frames. The code will automatically write summary frames to a video where the frame rate can be controlled. Use the following command to generate a .mp4
video
Where or how can I get the frames?
can i get frames from the .h5 files? or, shall I create frames from the raw videos?
Thank you very much !
@liuhaixiachina you need to decompose a video before doing other things e.g. feature extraction. You can use ffmpeg or python to do it.
@KaiyangZhou Hi, I am trying to use the code to test on my own video. I used a pretrained model to generate features and it has 4096 classes output. I see you said " the feature dimension does not matter" in the above. But, I got "RuntimeError: input.size(-1) must be equal to input_size. Expected 1024, got 4096".
Could you please tell me how to solve this issue?
Thanks a lot!
@babyjie57 you need to change the argument input_dim=4096
@KaiyangZhou Thanks for your reply. I also added '--input-dim 4096', but I got 'While copying the parameter named "rnn.weight_ih_l0_reverse", whose dimensions in the model are torch.Size([1024, 4096]) and whose dimensions in the checkpoint are torch.Size([1024, 1024]).'
Can you please tell me how to solve this issue?
Thanks!
I presume you are loading a model which was trained with features of 1024 dimensions but initialized with feature dimension = 4096.
Can you also publish the script for the KTS you used to generate the change points?
@Mctigger you can find the code here http://lear.inrialpes.fr/people/potapov/med_summaries.php
I got a question, if I want to use my own dataset, but there is no label in the dataset, when I construct the hdf5 file, what should I do with user_summary, gts_score and gtsummary?
Also I see these three labels only used in evaluation process, does this means that I can just delete them both in hdf5 and the evaluation function? (I mean in the pytorch implementation)
Moreover, if I want to use the result.json to generate a summarization video for a raw video, can I delete these three labels?
Have you solved this problem? I want to use my own video data but I don't know how to deal with user_summary, gts_score and gtsummary.
How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ?
@KaiyangZhou How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ? Did you use CNN features for KTS ? Are the CNN features subsampled and extracted from a video with 2 frame per second or 1 frame per second? Could you please share your code with me. Thank you very much !
How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ?
You can decompose a video using either ffmpeg or opencv. For the latter, there is an example code on the opencv website. You can write sth like
import numpy as np
import cv2
cap = cv2.VideoCapture(0)
video_features = []
while(still_has_frame):
# Capture frame-by-frame
ret, frame = cap.read()
# maybe skip this frame for downsampling
# feature extraction
feature = feature_extractor(frame) # or perform extraction on minibatch which leverages gpu
# store feature
video_features.append(feature)
summary = video_summarizer(video_features)
Did you use CNN features for KTS ? Are the CNN features subsampled and extracted from a video with 2 frame per second or 1 frame per second?
Yes. You can use CNN features which capture high-level semantics. Downsampling is a common technique as neighbour frames are redundant. 2fps/1fps is good.
Annotations are not required for training. Only frame features are required by the algorithm. You can qualitatively evaluate the results by applying the trained model to unseen videos and watch the summaries.
@KaiyangZhou Could you tell me where to download the original video(SumMe and TVSum)? Thank you very much !
Could you tell me where to download the original video(SumMe and TVSum)? Thank you very much !
Same question as above.
Could you please tell me where to download the original video(SumMe and TVSum)? Thank you very much in advance !
@KaiyangZhou Can you please tell me what is picks, how can we calculate it. What is it's dimensions?
@KaiyangZhou, how to use KTS to generate change points ? I use the official KTS code and employ CNN feature for each frame, but get same number of segments for every video. Is there any problem?
@KaiyangZhou To get change points, Does frames of a video input to X in "demo.py"? or Does features of each frames input?
@wjb123Hi, Do you solve this problem???→ "how to use KTS to generate change points ? I use the official KTS code and employ CNN feature for each frame, but get same number of segments for every video. Is there any problem?"
@chenchch94 this is my repository of forked. you can find generate_dataset.py in "utils" directory. Good Luck!
Hi, I could generate the .h5 file for my own dataset, however, my dataset has no annotations. Is it possible to use your code without annotated videos? If so, how? Thanks!
Hi, @neda60 For Train, Evaluation, It's not possible. Bus just for test, It's possible. To test, you need "features", "picks", "n_frame", "change_points", "n_frame_per_seg".
@KaiyangZhou Can you please share the code for .h5 file..How to deal with gtscore,gtsummary and usersummary?
Hello,I want to use your RL code to extract key frames. Now I use a complex network to extract features, and store it in .h5 file. But i didn't have other attribute such as gtscore and gtsummary( Because I guess dataset at least has these three attribute). Now, I try to create gtscore by creating an all one numpy array, but I don't know whether this is right or wrong. If wrong, how can I compute gtscore. Meanwhile, I create gtsummary by random sampling some frames, should I uniformly sampling?
@liuhaixiachina you need to decompose a video before doing other things e.g. feature extraction. You can use ffmpeg or python to do it.
I followed steps mentioned in README. It doesn't have video_frames neither. Shall I create frames from the raw videos?Is there any missing steps in README?How could I decompose video using ffmpeg or python?But there is no video in datasets.I also read the code of summary2video.py.Should I decompose "result.h5"
Hi: I have been able to run your algorithm on my machine (both training and test datasets). Now I would like to apply it to my dataset (my videos - they are not compressed to .h5). How do I do that? What function would I need to modify? Please guide. in readme, it says: _Visualize summary You can use summary2video.py to transform the binary machinesummary to real summary video. You need to have a directory containing video frames. The code will automatically write summary frames to a video where the frame rate can be controlled. Use the following command to generate a .mp4 video
Where or how can I get the frames?I followed steps mentioned in README. It doesn't have video_frames neither. Shall I create frames from the raw videos?Is there any missing steps in README?How could I decompose video using ffmpeg or python?But there is no video in datasets.I also read the code of summary2video.py.Should I decompose "result.h5"?
@KaiyangZhou How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ? Did you use CNN features for KTS ? Are the CNN features subsampled and extracted from a video with 2 frame per second or 1 frame per second? Could you please share your code with me. Thank you very much !
_Visualize summary You can use summary2video.py to transform the binary machinesummary to real summary video. You need to have a directory containing video frames. The code will automatically write summary frames to a video where the frame rate can be controlled. Use the following command to generate a .mp4 video
Where or how can I get the frames?I followed steps mentioned in README. It doesn't have video_frames neither. Shall I create frames from the raw videos?Is there any missing steps in README?How could I decompose video using ffmpeg or python?But there is no video in datasets.I also read the code of summary2video.py.Should I decompose "result.h5"?
Hi: I have been able to run your algorithm on my machine (both training and test datasets). Now I would like to apply it to my dataset (my videos - they are not compressed to .h5). How do I do that? What function would I need to modify? Please guide.