Pretrained window/hybrid SABlock backbone model for Detection task

CanyonWind commented 2 years ago

Hi, thank you for the contribution to this super-rad work!

Wonder that, in your experiments, whether the backbone models used in Detection task with stage-3 window/hybrid SABlock (S-h14, B-h14) are needed to be pretrained on imagenet?

If so, could these backbones with window/hybrid SABlock be released? And if not, are the weights loaded directly from the regular model with global window in stage-3?

Thanks!

Andy1621 commented 2 years ago

Sorry for the late response. Actually, the backbones used in downstream tasks with window/hybrid SABlock are not need to be pre-tained on ImageNet. The weights are directly loaded.

Since we train the backbone with 224x224 images on ImageNet, the normal self-attention in stage-3 can be seen as window self-attention with size 14x14 (224/16=14). Hence, for the downstream tasks, we only need to ensure the window size in stage-3 is larger than 14x14. Thus the pre-trained weights can be well utilized.

CanyonWind commented 2 years ago

I see, that makes a lot sense.

And it is mentioned in the detection readme that the backbones are not trained with token labeling and layer scale. Wonder any particular reasons for that? In other words, do you think the backbones with these two techniques might be a better choice here? Thanks!

Andy1621 commented 2 years ago

Thanks for your good questions. We do not use backbones trained with Token Labeling for fair comparisons, since most SOTA models are trained without it.

As for layer scale, it is good for avoiding NaN for ImageNet. We have tried to used backbones with Layer Scale for video classification, but it sometimes can not converge in the later epochs. It seems strange but it does not affect other downstream tasks. Thus we use the same backbones for all downstream tasks for fair comparisons (without the above two techniques).

Actually, most of the current works used Layer Scale for training large models. I think it is a better choice to use it. Moreover, backbones trained with Token Labeling also help downstream tasks like detection. If you only pursue better results, you can use those powerful models with Token Labeling.

CanyonWind commented 2 years ago

Thanks for sharing these. Please bear my ignorance of these two techniques, and I guess I am supposed to look up details in the two papers. But since you are super proficient, it would be appreciated if you could briefly educate me on two questions:

Is it correct that the model trained with token labeling will only have one difference than the regular model -- an extra auxiliary head to produce dense predictions? And when applying the backbone to downstream tasks, the additional auxiliary head is not needed anymore right?
I saw you only trained the large model with layer scale. Will it be effective on smaller models as well, or it usually only helps for the large ones?

CanyonWind commented 2 years ago

And when using the backbone in the detection task, in UniFormer module, each stage initializes a new norm1-4 layer and discards the original self.norm layer (the one before the final classification head). Could you please share some insights on this implementation as well? Thanks a lot!

Andy1621 commented 2 years ago

@CanyonWind Thanks for your detailed questions!

About the two techniques:

Yes, the model trained with token labeling will only have one difference than the regular model, since it need to generate dense labels for all tokens. You can find it in the original repo. Actually for downstream tasks, the classification head is not used anymore. Researchers only utilized the intermediate features to form FPN. Thus for backbone trained with Token Labeling, the additional auxiliary head and the classification head are both removed.
For smaller models, it doesn't help for accuracy improvement but it helps for normal convergence. Note that all vision transformers are easier to be NaN when training. The layer Scale is a smart technique to avoid this problem. For larger models, Layer Scale is often used for better convergence. You can find the results in the original paper (CaiT).

About norm for detection task:

Sorry, I don't know the detailed reason. But actually, the extra norm layers are popular settings for detection and other downstream tasks. You can find them in MMDetection: ResNet and Swin. By the way, maybe you can find some insights in papers about detection.

CanyonWind commented 2 years ago

Thanks for the clarifications. These are awesome!

Andy1621 commented 2 years ago

@CanyonWind We have verified that Token Labeling can help detection. Have a try!

Sense-X / UniFormer

Pretrained window/hybrid SABlock backbone model for Detection task #12