SHI-Labs / Compact-Transformers

Escaping the Big Data Paradigm with Compact Transformers, 2021 (Train your Vision Transformers in 30 mins on CIFAR-10 with a single GPU!)
https://arxiv.org/abs/2104.05704
Apache License 2.0
495 stars 77 forks source link

Question about the batch size #64

Closed imhgchoi closed 1 year ago

imhgchoi commented 2 years ago

Hi, this work is awesome. I just have one little question. The paper says the total batch size is 128 for CIFAR's and 4 GPU's were used in parallel. That doesn't mean the total batch size is 128 * 4 = 512, does it? DDP is for Imagenet, and non-distributed is for CIFAR, am I correct?

Thanks a ton :)

alihassanijr commented 1 year ago

Hi and thank you for your interest.

My apologies for noticing this issue so late. Yes, that is correct, CIFAR and most other small datasets were trained on a single GPU and without DDP, therefore the batch size in the config file should reflect the final batch size. ImageNet, and most of the fine-tuned experiments, were distributed, therefore the batch size would be multiplied by 8 (number of gpus).

imhgchoi commented 1 year ago

No worries :) Thank you very much for your reply.