Sense-X / UniFormer

[ICLR2022] official implementation of UniFormer
Apache License 2.0
819 stars 111 forks source link

Pretrained window/hybrid SABlock backbone model for Detection task #12

Closed CanyonWind closed 2 years ago

CanyonWind commented 2 years ago

Hi, thank you for the contribution to this super-rad work!

Wonder that, in your experiments, whether the backbone models used in Detection task with stage-3 window/hybrid SABlock (S-h14, B-h14) are needed to be pretrained on imagenet?

If so, could these backbones with window/hybrid SABlock be released? And if not, are the weights loaded directly from the regular model with global window in stage-3?

Thanks!

Andy1621 commented 2 years ago

Sorry for the late response. Actually, the backbones used in downstream tasks with window/hybrid SABlock are not need to be pre-tained on ImageNet. The weights are directly loaded.

Since we train the backbone with 224x224 images on ImageNet, the normal self-attention in stage-3 can be seen as window self-attention with size 14x14 (224/16=14). Hence, for the downstream tasks, we only need to ensure the window size in stage-3 is larger than 14x14. Thus the pre-trained weights can be well utilized.

CanyonWind commented 2 years ago

I see, that makes a lot sense.

And it is mentioned in the detection readme that the backbones are not trained with token labeling and layer scale. Wonder any particular reasons for that? In other words, do you think the backbones with these two techniques might be a better choice here? Thanks!

Andy1621 commented 2 years ago

Thanks for your good questions. We do not use backbones trained with Token Labeling for fair comparisons, since most SOTA models are trained without it.

As for layer scale, it is good for avoiding NaN for ImageNet. We have tried to used backbones with Layer Scale for video classification, but it sometimes can not converge in the later epochs. It seems strange but it does not affect other downstream tasks. Thus we use the same backbones for all downstream tasks for fair comparisons (without the above two techniques).

Actually, most of the current works used Layer Scale for training large models. I think it is a better choice to use it. Moreover, backbones trained with Token Labeling also help downstream tasks like detection. If you only pursue better results, you can use those powerful models with Token Labeling.

CanyonWind commented 2 years ago

Thanks for sharing these. Please bear my ignorance of these two techniques, and I guess I am supposed to look up details in the two papers. But since you are super proficient, it would be appreciated if you could briefly educate me on two questions:

CanyonWind commented 2 years ago

And when using the backbone in the detection task, in UniFormer module, each stage initializes a new norm1-4 layer and discards the original self.norm layer (the one before the final classification head). Could you please share some insights on this implementation as well? Thanks a lot!

Andy1621 commented 2 years ago

@CanyonWind Thanks for your detailed questions!

About the two techniques:

About norm for detection task:

CanyonWind commented 2 years ago

Thanks for the clarifications. These are awesome!

Andy1621 commented 2 years ago

@CanyonWind We have verified that Token Labeling can help detection. Have a try!