Question about the experimental setup with Hockey dataset in the paper "Efficient Violence Detection Using 3D Convolutional Neural Networks"

JimLee1996 / AVSS2019

Efficient Violence Detection Using 3D Convolutional Neural Networks

MIT License

15 stars 10 forks source link

Question about the experimental setup with Hockey dataset in the paper "Efficient Violence Detection Using 3D Convolutional Neural Networks" #9

Closed SonNguyen2510 closed 1 year ago

SonNguyen2510 commented 3 years ago

First of all, I would like to thank you for your contribution to the community with your great work. I'm really interested in your study and I am trying to implement your code with the same experimental setup with Hockey dataset like in your paper. However there are something ambiguous for me here:

The learning rate in the paper is 10^-3 but in your code I see that it is setting to 10^-2, what proper learning rate should I used?
There are 2 types of 3D DenseNet (Lean & Original) in your code, what model did you used in your paper with Hockey dataset?
Are the parameters like epoch, acc_baseline (150, 0.92) in the code the same as the condition you set in the article?
And the final accuracy in the paper: 98.3±0.81%, did you calculate the average accuracy and the standard deviation of all the models saved after 5-fold cross validation that are higher than acc_baseline? Thank you!

JimLee1996 commented 3 years ago

Thanks for your questions. For 1, the default learning rate in the code is for faster training. And for better convergence please use the learning rate in paper. For 2, the model in paper uses the lean one (It is mentioned in paper, maybe not so clear). For 3, these hardcode params is actually not very important hyper parameters as it is set to save storage of weight files. But in experiment, yes. For 4, yep, we use 5-fold cross validation. We train the model 5 times, and log the best val accuracy for each sub-experiment.

SonNguyen2510 commented 3 years ago

Thank you very much for the quick answer! And I have 1 more question, in the training phase, for each video sample, you temporally random slice 16 adjacent frames once or several times? For example 1 sample video with duration of 46 frames, do you random slice 16 adjacent frames once or more before feeding to the 3D CNN model? And these numbers in "Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])", Could you tell me where you get these? Thank you very much

JimLee1996 commented 3 years ago

Once per epoch but several times in the total training phase.

Normalize parameters could also be N~(0,1), but I take the RGB channel params from PyTorch official, which is counted on imagenet. Surely, it's better to calculate them on specific dataset.

SonNguyen2510 commented 3 years ago

Thank you very much sir!