Why is total_iters unrelated to batch size and number of data?

facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.

Apache License 2.0

8.41k stars 708 forks source link

Why is total_iters unrelated to batch size and number of data? #437

Open taweitang opened 1 week ago

taweitang commented 1 week ago

Hi, I have been using this code to train a custom dataset recently. I noticed that in train.py, total_iters is definded as total_iters = cfg.optim["epochs"] OFFCIAL_EPOCH_LENGTH. This is confusing for me. As to my knowledge, total_iters should be equal to epoch number of training data / batch size.

Also, when training with multiple GPUs, I observed that each GPU will run total_iters instead of total_iters / number of GPUs. It also confuses me.

Is anyone know what is the purpose of this design?

In addition, if I want to train my custom dataset with the iters which recognized by the public, what should I do? Just set the total_iters as epoch * number of training data / batch size? Is there any other setting I need to modify? Thanks!

bug-fixed commented 1 week ago

Hello, I'm also confused on a related problem. In the ssl_default_config.yaml, the params are batch_size_per_gpu: 64 and OFFICIAL_EPOCH_LENGTH: 1250. And in the README, it says Run DINOv2 training on 4 A100-80GB nodes (32 GPUs) in a SLURM cluster environment with submitit.

So, the global batch size is 32x64=2048, and times the OFFICIAL_EPOCH_LENGTH, it is 2560000, which seems about twice on the number of ImageNet. Could you please help check this? Thanks.

tengjn commented 1 week ago

I have the same question，the training time is not related with GPU number in the release code