Closed ziqipang closed 4 months ago
Hey. Did you find any configs?
@Alihjt No luck. One thing I found was that per-gpu batch size seemed to influence the numerical stability (acc improved with a smaller per-gpu batch size). Although I didn't had the chance to verify or explain the reason, using 16GPUs x 64 images per GPU would give a better performance than my previous run (8 GPUs x 128 images per GPU).
Thank you for your excellent work and for sharing the code! I learned a lot from what you have described.
Recently, I have been trying to use DeiT to train a plan ViT-Base model. I could follow the documentation to reproduce the ViT-Tiny and ViT-Small performance, but the same training procedure on ViT-Base has the accuracy of 78.9% on ImageNet1K, which is even worse than ViT-Small.
Therefore, I am wondering what could be the hidden tricks for training a good ViT-Base. Could you please share some hints? Thank you so much for the help!