What batch size number other than batch size of 1024 have been tried when training a DeiT or ViT model? In the paper, DeiT (https://arxiv.org/abs/2012.12877), they used a batch size of 1024 and they mentioned that the learning rate should be scaled according to the batch size.
However, I was wondering if anyone have any experience or successfully train a DeiT model with a batch size that is even less than 512? If yes, what accuracy did you achieve?
What batch size number other than batch size of 1024 have been tried when training a DeiT or ViT model? In the paper, DeiT (https://arxiv.org/abs/2012.12877), they used a batch size of 1024 and they mentioned that the learning rate should be scaled according to the batch size.
However, I was wondering if anyone have any experience or successfully train a DeiT model with a batch size that is even less than 512? If yes, what accuracy did you achieve?