TIBHannover / MSVA

Deep learning model for supervised video summarization called Multi Source Visual Attention (MSVA)
MIT License
41 stars 17 forks source link

extract object features #10

Open xxhiao opened 3 years ago

xxhiao commented 3 years ago

Hi, I am trying to extract the object features using your code in https://github.com/VideoAnalysis/EDUVSUM/tree/master/src.

According to your paper, you are using the googleNet trained with imagenet. I assume that you are extracting features using the model "modelInceptionV3" as in the codes. However, the feature shape of " inceptionv3_feature = modelInceptionV3.predict(frmRz299)" is (8,8, 2048). I tried to change the model initialization code to "modelInceptionV3=InceptionV3(weights='imagenet', pooling='avg', include_top=False)' to get a 2048 feature vector. However, the object feature vector length is 1024 in the MSVA codes, and I noticed that the values of features from the feature extraction code is quite different from that in the MSVA codes. For the former, the feature values can be larger than 1, but in the latter, the value seemed to be normalized to [0,1] range.

Have I missed something?

StevRamos commented 3 years ago

Hi, I have the same question, but I found a code https://github.com/KaiyangZhou/pytorch-vsumm-reinforce#readme. I think they took those processed videos from that repository. I found another repository https://github.com/SinDongHwan/pytorch-vsumm-reinforce/blob/master/utils/generate_dataset.py in which they tried to replicate the processed dataset (h5 files). I relied on the latter to try and replicate the data sets, you can find the code here https://github.com/StevRamos/video_summarization/tree/main/src. I also compared the shape of the dataset and the ones I made, it seems they are the same but the extracted feature is not the same (I also used google net from pytorch). Let me know if there is another idea that can help us replicate the dataset

xxhiao commented 3 years ago

thank you very much, stevRamos. That's very helpful.

By the way, does anyone know know which optical flow algorithm that the authors of MSVA used to extract I3D features?

mpalaourg commented 3 years ago

@StevRamos Our methods for replicating the dataset are very similar (almost identical). I think that the extracted features are not the same with the given dataset (or KaiyangZhou's, where I also think they took them), either because of the different versions used (PyTorch etc.) or because the initial dataset (KaiyangZhou's) wasn't produced by using GoogleNet from PyTorch.

@xxhiao I didn't deal with the optical features, so I can't be of any help for the implementation. If I remember correctly the paper for the architecture used (the pretrained I3D (Inflated 3D ConvNet)) was this one , where they say on 2.5

We computed optical flow with a TV-L1 algorithm.

I think this is a good enough assumption, for the algorithm used.

StevRamos commented 3 years ago

@mpalaourg It is good to know that I am on the right track. How did you deal with the ground truth score? I averaged but I think there is one more step that I am ignoring that would help me a lot. If you have any ideas, I will be grateful if you share them with me. By the way, is your code public? Thanks in advance!

@xxhiao they used TV-L1 algorithm. I think they took 16 frames in a group but it depends on how many RAM you have. In my case, I couldn't replicate this but when I resampled I was able to do it.

mpalaourg commented 3 years ago

@StevRamos the pipeline for the gtscore calculation is a bit different for each dataset.

Summe, given you have downloaded the files from here, you shall have access to a folder named GT with some *.mat files. Each of this file has a user_score matrix with shape (frames, annotators) and a gtscore vector with shape (frames, 1). Somewhere in their paper they say that they give the videos in random order (to avoid users selecting only frames in the beginning of the video), so the different numbers in user_score is just the order of the selected frames/shots and not importance. In MATLAB code, for a video you would want something like this:

  idxs = find(user_score > 0);
  my_user_score = user_score;
  my_user_score(idxs) = 1;
  my_gtscore = mean(my_user_score, 2);
  all(my_gtscore == gt_score)

Then, to get the right shape you have to sub-sampled to 2fps each video, where they wrongfully assumed each video is 30fps.

TVSum, given you have downloaded the files from here, you shall have access to a folder named ../ydata-tvsum50-data/data with an anno.tsv file. Here, the 3rd columns is indeed importance scores with a max value of 5. Normalize this 3rd column for each video, and again take the average.

Our code isn't public yet we are waiting for a double-blind review for our work, and then we will released it. Although, to be honest the data preparation code won't be released, because we also used the same data as this repo (different splits). The only reason I was playing with the data, was to fully understand them and used them correctly!

PS. If I didn't understand your question, feel free to ask again 😅

StevRamos commented 2 years ago

@mpalaourg I almost made a big mistake in the Summe dataset. Thanks! Oh I get it. Did you try using other datasets? I am trying to use "CoSum" and "VSUMM", which I have read are also widely used. However, I don't know why they don't use it in this repository.

I will follow you to see your results when you finish it. Again, thank you very much.