Any hyper parameter suggestions for other model architectures?

facebookresearch / moco-v3

PyTorch implementation of MoCo v3 https//arxiv.org/abs/2104.02057

Other

1.2k stars 158 forks source link

Any hyper parameter suggestions for other model architectures? #20

Closed Harick1 closed 2 years ago

Harick1 commented 2 years ago

I noticed that this repository only provide the results and experiment settings of ResNet50 and ViT series model.

And when I try to reproduce the results, I found that the final linear probing accuracy is very sensitive to the hyper parameters, such as learning rate, optimizer, augmentations, etc.

Are there any suggestions for training MoCo-v3 on other models, such as EfficientNet, ResNet101, etc. ? And how to adjust the hyper parameters for different model architectures?

endernewton commented 2 years ago

Yeah it is quite sensitive to the hyper parameters in linear probing, which we also observed. One idea we adopted in our more recent MAE work (https://arxiv.org/abs/2111.06377) is to add an additional, parameter-free BN0 (so BN without learnable weights) to normalize features "on-the-fly" before the linear classifier. Since the BN0 stats can be absorbed in the weights, it does not violate the linear probing rule and can help reduce parameter search when changing to other architectures. For linear probing, the main parameters to search is learning rate (likely due to different feature scales).

Harick1 commented 2 years ago

I read the MAE recently, it's a really wonderful work!

But i'm very confused about the handling of class token when pre-train the MAE model.

The paper said: As ViT has a class token [16], to adapt to this design, in our MAE pre-training we append an auxiliary dummy token to the encoder input. This token will be treated as the class token for training the classifier in linear probing and fine-tuning.

How can this dummy token work in linear probing ? Since it's not explicitly used in pre-training. Is there any other information I missed?

farzadips commented 2 years ago

I read the MAE recently, it's a really wonderful work!

But i'm very confused about the handling of class token when pre-train the MAE model.

The paper said: As ViT has a class token [16], to adapt to this design, in our MAE pre-training we append an auxiliary dummy token to the encoder input. This token will be treated as the class token for training the classifier in linear probing and fine-tuning.

How can this dummy token work in linear probing ? Since it's not explicitly used in pre-training. Is there any other information I missed?

Hello, Have you tried other architectures like EfficientNet?