dingfengshi / TriDet

[CVPR2023] Code for the paper, TriDet: Temporal Action Detection with Relative Boundary Modeling
MIT License
160 stars 13 forks source link

The issue concerns the initialization of weights and biases in the SGP module. #37

Open lixueli8 opened 3 months ago

lixueli8 commented 3 months ago

Hi, This work is so interesting. However, I have some questions about the initialization of weights and biases in the SGP module. Due to my lack of coding experience, I can not understand why the weights and biases of the convolutions in the SGP block are initialized to 0. When I was debugging, the weights and biases of these convolutions were 0 in both training and testing. What I understood was that they had no effect. So how does the SGP block play a role in the code? How to understand it? (Note: If init_conv_var is set to non-0, the results will drop a lot.), Thank you. ![Uploading 1715608925496.jpg…]()

dingfengshi commented 3 months ago

Hi, for the THUMOS14 dataset, since the action lengths are relatively short, we found in our experiments that initializing from zero weights (i.e., retaining only the residual connections at the very beginning) can make the training more stable. For datasets like HACS and Activitynet, which contain many long actions, enabling the SGP layer during initialization can achieve better results.

lixueli8 commented 3 months ago

Hi, for the THUMOS14 dataset, since the action lengths are relatively short, we found in our experiments that initializing from zero weights (i.e., retaining only the residual connections at the very beginning) can make the training more stable. For datasets like HACS and Activitynet, which contain many long actions, enabling the SGP layer during initialization can achieve better results.

Thank you for your reply. However, through experiments, I found that throughout the training and testing process, the weights and biases of the convolutions involving the SGP block are still 0, equivalent to the SGP convolution always being 0, and only the residual connections are retained.

dingfengshi commented 3 months ago

The characteristic of THUMOS14 dataset is that a video has dozens or hundreds of actions, and many actions have only a length of several features. For this kind of dataset using a large window to aggregate features is not necessary, but for large-scale and highly varying datasets such as HACS, enabling the multi-scale feature extraction will be more necessary.

lixueli8 commented 3 months ago

The characteristic of THUMOS14 dataset is that a video has dozens or hundreds of actions, and many actions have only a length of several features. For this kind of dataset using a large window to aggregate features is not necessary, but for large-scale and highly varying datasets such as HACS, enabling the multi-scale feature extraction will be more necessary.

What you said makes sense, thank you for your reply.