Closed vateye closed 2 years ago
Thanks for your good question! As shown in our ablation, local MHRA and temporal downsampling only work for those temporal-related datasets (i.e., Sth-Sth). For those scene-related datasets (e.g., Kinetics), they bring little or no improvement. We hope to give the simplest model, which is helpful for scaling up the model, thus we only add a 4-layer global MHRA in most of our models. Such designs have been demonstrated simple yet effective in our experiments.
Thanks for your reply. Besides, I have some questions about the initialization and the training. Since you introduce new parameters to the original ViT, I did not see anywhere you initialize the new parameters with Xavier or Normal (not those initialized with zeros). Meanwhile, the new parameters seem to be share the same learning rate with the ViT, I am wondering why not use larger learning rate for those parameters.
For most of the new parameters, I just use the default initialization. In fact, the default initialization in PyTorch is enough. You can read the source code for those layers in PyTorch.
For some special layers, I initialize them with zeros, including the last point-wise convolutions in the local temporal MHRA, the query tokens and output projection layers in the query-based cross MHRA, the last linear layers in the FFN of the global UniBlock, and the learnable fusion weights.
Such zero initialization can maintain the original input, thus making the training stable at first. More importantly, I use a relatively small learning rate (e.g., 2e-5) compared to the previous work. Thus I don't need to decrease/increase the learning rate for those new parameters. BTW, in my experiments, changing the learning rate scale does not bring any improvement.
Thanks!
I have noticed that 'NO_LMHRA' is enabled in most of the experiments such as K400/K600/K700/K710. Why we should not use LMHRA in the model since it is intuitive that the local temporal cue should be explored during the training.