StevRamos / video_summarization

A computing solution based on deep learning that allows the efficient generation of keyshot type spotlights from videos.
MIT License
20 stars 4 forks source link

Supplementary material (datasets etc.) #1

Open mpalaourg opened 3 years ago

mpalaourg commented 3 years ago

Hi Stev,

I'll open this issue to continue our discussion from here, because it isn't about the object features anymore, but rather in general for video summarization datasets.

I highly encourage you to read this paper, Video Summarization Using Deep Neural Networks: A Survey. In particular Table II, page 13, you can see that the most used datasets out there are TVSum and SumMe (OVP and YouTube is mainly for augmentation purposes) and I think that's the reason that every repo out there is using only those h5 files. As a latest trend, I see some new works using VTW dataset, but not the two you mentioned (CoSum and VSUMM).

If you want, you can keep this issue open and keep our discussion live about video summarization in general, and not only about datasets.

George

StevRamos commented 3 years ago

Hi George!

Let's keep this issue open. VSUMM contains OVP and YouTube so that's correct. They don't use CoSum but I'm trying to use it as a augmentation and see if it works.

Regarding feature vectors, did you normalize this? I mean in the MSVA repo they used RGB and FLOW feature vectors but they don't specify if they transform this or they use it in its original form. I ask this because I'm trying to retrain the model with different vectors. Thanks!

Stev

mpalaourg commented 3 years ago

Regarding feature vectors, did you normalize this? I mean in the MSVA repo they used RGB and FLOW feature vectors but they don't specify if they transform this or they use it in its original form. I ask this because I'm trying to retrain the model with different vectors. Thanks!

Every time I used some new (or additional) feature representation, it was together with the already extracted GoogleNet features from KaiyangZhou's work. In this work, the features were normalized with L2 norm equal to 1 for each vector. To that end, and to have consistency among the used feature vectors, I also normalized them.

I am not sure if it's necessary to do it, if you are not going to use another set of features, but it's definitely a good practice.

George

StevRamos commented 3 years ago

Regarding feature vectors, did you normalize this? I mean in the MSVA repo they used RGB and FLOW feature vectors but they don't specify if they transform this or they use it in its original form. I ask this because I'm trying to retrain the model with different vectors. Thanks!

Every time I used some new (or additional) feature representation, it was together with the already extracted GoogleNet features from KaiyangZhou's work. In this work, the features were normalized with L2 norm equal to 1 for each vector. To that end, and to have consistency among the used feature vectors, I also normalized them.

I am not sure if it's necessary to do it, if you are not going to use another set of features, but it's definitely a good practice.

George

That makes sense. Thanks George! I noticed that on the MSVA paper they did not normalize the RGB and FLOW feature vectors. I think it should be normalized because there are three different feature vectors and if we want them to contribute equally, they have to be normalized. Then, in the training phase, the model would learn which one to assign more weight to. What do you think?

mpalaourg commented 3 years ago

The logic behind normalization of the different features, is that you must convert the matching scores with different scales into a common domain. Once the matching scores are normalized, they can be fused with each other. That being said, normalization is not something standard, and various techniques exist (i.e Min-Max Normalization, Z-score Normalization, Tanh Normalization, L-p Normalization). You should check and validate, which technique works best for your problem and problem formulation.

I didn't go over the code of MSVA in such detail, but I assumed they normalized their features before fusion. Finally, I don't think that the model would learn which one to assign more weight to, but rather that the model has a space (common for all features) where each feature had equally contributed. In such a space it's easier for your model to find the (non-linear) function which describes your data.

edit: One more thing I thought after posting is that: If your features have different scales, then in back propagation the gradient would result in bigger changes in part of the network and smaller in other part, because of that scale!

StevRamos commented 3 years ago

Thank you very much @mpalaourg ! Your answer was very clear and I agree with you. That's right, and the last two weeks I've been experimenting with different functions. At first, I didn't normalize them and my results didn't improve, but then I normalized the features and got better results. Therefore, normalization helps the model to learn this nonlinear function correctly. After that, I tried adding other datasets like OVP and YouTube dataset, but my results didn't improve much. Any suggestions on this?

According to many papers, to validate the performance of the model they used cross validation, but this is only for validation. Do you know with what data they train to obtain the pre-trained weights? I don't remember if they mentioned that, but I think I should train using all the data.

mpalaourg commented 3 years ago

Thank you very much @mpalaourg ! Your answer was very clear and I agree with you. That's right, and the last two weeks I've been experimenting with different functions. At first, I didn't normalize them and my results didn't improve, but then I normalized the features and got better results. Therefore, normalization helps the model to learn this nonlinear function correctly.

I am glad that I can be of any help.

<...> I tried adding other datasets like OVP and YouTube dataset, but my results didn't improve much. Any suggestions on this?

I don't think that augmented data (i.e. usage of extra dataset) will guarantee better perfomance. That being said, the underlying structure of those datasets is different compared to SumMe and TVSum. SumMe and TVSum gtscore contains real value between 0 and 1 (importance scores) and YouTube, OVP gtscore contains 0 and 1 (key shot selection). Furthermore, gtsummary (if you train with that ground truth) on SumMe and TVSum contain key shot selection, and YouTube and OVP contain key frame selection. I havent experiment with augmented data (is in my plans to do so), but I would assume that some extra work must be done for the ground truth to be equivalent between datasets!

According to many papers, to validate the performance of the model they used cross validation, but this is only for validation. Do you know with what data they train to obtain the pre-trained weights? I don't remember if they mentioned that, but I think I should train using all the data.

Cross validation means that you have a percentage of the data for training (normally 80%), a percentage for testing (20%), and from that percentage of training dataset, you split it again on 80% for training and 20% for validation. Every hyperparameter tuning must be done on the 80% training and 20% validation (here used as test set). The final final model will be evaluated on the remaining 20% for test (aka these values will be reported on the paper). This is theory. On practice the datasets here are really small and this cannot be applied. A great percentage of the published work is not using cross validation (us included!) and many works even use the test data for training. The former isn't so bad from a theoretical point, but the latter is! DO NOT use every point of the dataset for training. If you can afford it go for cross validation, but at least keep a 20% for testing!

fangxiaohei commented 3 years ago

I am very happy to join your discussion and I am gratitude what your contribution.The question is that do you try the DSNet Using Custom Videos,when I use my data(videos and labels) I can't make the dataset, it told me that "UnboundLocalError: local variable 'gtscore' referenced before assignment". Do you know the reason? Thank you, I am sorry my English is not good.

StevRamos commented 3 years ago

I don't think that augmented data (i.e. usage of extra dataset) will guarantee better perfomance. That being said, the underlying structure of those datasets is different compared to SumMe and TVSum. SumMe and TVSum gtscore contains real value between 0 and 1 (importance scores) and YouTube, OVP gtscore contains 0 and 1 (key shot selection). Furthermore, gtsummary (if you train with that ground truth) on SumMe and TVSum contain key shot selection, and YouTube and OVP contain key frame selection. I havent experiment with augmented data (is in my plans to do so), but I would assume that some extra work must be done for the ground truth to be equivalent between datasets!

I agree @mpalaourg . According some works I've seen, they used OVP and YouTube in its original form (as you mention). I tried to transform its original ground truth to importance score. But, as you said, maybe I skip that extra work. In my future work, I will update that part, but for now it doesn't look a correct work. Hope to get news for your experimentation with augmented data.!

Cross validation means that you have a percentage of the data for training (normally 80%), a percentage for testing (20%), and from that percentage of training dataset, you split it again on 80% for training and 20% for validation. Every hyperparameter tuning must be done on the 80% training and 20% validation (here used as test set). The final final model will be evaluated on the remaining 20% for test (aka these values will be reported on the paper). This is theory. On practice the datasets here are really small and this cannot be applied. A great percentage of the published work is not using cross validation (us included!) and many works even use the test data for training. The former isn't so bad from a theoretical point, but the latter is! DO NOT use every point of the dataset for training. If you can afford it go for cross validation, but at least keep a 20% for testing!

No, I haven't train in all the data. I used cross validation too (there is no an official split so that's another problem), to evaluate the performance of the model (at least in MSVA paper they did it, and if Im not wrong they published this result in the paper). But this doesn't output the weights of the model (or yes?) bc which one would yo use? the train of the last split? the average of all the splits? I mean, cross validation doesn't give you the weights, just the performance. But I think I get your idea, at the end when I have my performance I have to split the data and train that 80% so that would be my final weights.

StevRamos commented 3 years ago

I am very happy to join your discussion and I am gratitude what your contribution.The question is that do you try the DSNet Using Custom Videos,when I use my data(videos and labels) I can't make the dataset, it told me that "UnboundLocalError: local variable 'gtscore' referenced before assignment". Do you know the reason? Thank you, I am sorry my English is not good.

Hi @fangxiaohei! I haven't tried DSNet but the error is because the custom dataset has to have that "gtscore" which stands for ground truth score (annotation of users). That's for training the model. Otherwise, you won't be able to train the model with your custom dataset. I hope it helps you!

mpalaourg commented 3 years ago

No, I haven't train in all the data. I used cross validation too (there is no an official split so that's another problem), to evaluate the performance of the model (at least in MSVA paper they did it, and if Im not wrong they published this result in the paper). But this doesn't output the weights of the model (or yes?) bc which one would yo use? the train of the last split? the average of all the splits? I mean, cross validation doesn't give you the weights, just the performance. But I think I get your idea, at the end when I have my performance I have to split the data and train that 80% so that would be my final weights.

We make our code public here (or my fork would work too 😅). As you can see in the README, we used Zenodo to released our pretrained models, of our two main experiments (paper coming soon!). We release the weights for each split, dataset and experiment, as any of this model is a separate entity trained in different videos. So, No you don't have to take the average of the different splits. I don't even think that the model given by the average of 5 models is something meaningful!

StevRamos commented 3 years ago

@mpalaourg I checked your repo but I didn't find your results. Where are they or if you can tell me if you got the state of the art results? :)

mpalaourg commented 2 years ago
@StevRamos sorry for the late reply, but deadlines this month was crazy and I didn't have any time to response. Our work (Camera ready version) isn't publish on IEEE Xplore yet. It's on my plans, to upload on my site the accepted version in the following days. I'll ping you here to notify you. Until then, our results are: Eval Protocol SumMe TVSum
Ours 55.6 61.0
VASNet's / MSVA's 57.1 62.7
mpalaourg commented 2 years ago

Hi @StevRamos, the pre_print version of our work was uploaded in my site (https://mpalaourg.me/#publications). Hope to find it interesting!

StevRamos commented 2 years ago

Hi @mpalaourg , I was very busy too. I just finished college. I'll read your pre print. It seems very interesting and is a topic that I would still like to explore. Let's keep in touch here or on LinkedIn :)