Closed CanyonWind closed 2 years ago
Sorry for the late response. Actually, the backbones used in downstream tasks with window/hybrid SABlock are not need to be pre-tained on ImageNet. The weights are directly loaded.
Since we train the backbone with 224x224 images on ImageNet, the normal self-attention in stage-3 can be seen as window self-attention with size 14x14 (224/16=14). Hence, for the downstream tasks, we only need to ensure the window size in stage-3 is larger than 14x14. Thus the pre-trained weights can be well utilized.
I see, that makes a lot sense.
And it is mentioned in the detection readme that the backbones are not trained with token labeling and layer scale. Wonder any particular reasons for that? In other words, do you think the backbones with these two techniques might be a better choice here? Thanks!
Thanks for your good questions. We do not use backbones trained with Token Labeling for fair comparisons, since most SOTA models are trained without it.
As for layer scale, it is good for avoiding NaN for ImageNet. We have tried to used backbones with Layer Scale for video classification, but it sometimes can not converge in the later epochs. It seems strange but it does not affect other downstream tasks. Thus we use the same backbones for all downstream tasks for fair comparisons (without the above two techniques).
Actually, most of the current works used Layer Scale for training large models. I think it is a better choice to use it. Moreover, backbones trained with Token Labeling also help downstream tasks like detection. If you only pursue better results, you can use those powerful models with Token Labeling.
Thanks for sharing these. Please bear my ignorance of these two techniques, and I guess I am supposed to look up details in the two papers. But since you are super proficient, it would be appreciated if you could briefly educate me on two questions:
And when using the backbone in the detection task, in UniFormer
module, each stage initializes a new norm1-4
layer and discards the original self.norm
layer (the one before the final classification head). Could you please share some insights on this implementation as well? Thanks a lot!
@CanyonWind Thanks for your detailed questions!
About the two techniques:
About norm
for detection task:
Thanks for the clarifications. These are awesome!
@CanyonWind We have verified that Token Labeling can help detection. Have a try!
Hi, thank you for the contribution to this super-rad work!
Wonder that, in your experiments, whether the backbone models used in Detection task with stage-3 window/hybrid SABlock (S-h14, B-h14) are needed to be pretrained on imagenet?
If so, could these backbones with window/hybrid SABlock be released? And if not, are the weights loaded directly from the regular model with global window in stage-3?
Thanks!