Open taweitang opened 1 week ago
Hello, I'm also confused on a related problem. In the ssl_default_config.yaml
, the params are batch_size_per_gpu: 64
and OFFICIAL_EPOCH_LENGTH: 1250
. And in the README, it says Run DINOv2 training on 4 A100-80GB nodes (32 GPUs) in a SLURM cluster environment with submitit
.
So, the global batch size is 32x64=2048, and times the OFFICIAL_EPOCH_LENGTH, it is 2560000, which seems about twice on the number of ImageNet. Could you please help check this? Thanks.
I have the same question,the training time is not related with GPU number in the release code
Hi, I have been using this code to train a custom dataset recently. I noticed that in train.py, total_iters is definded as total_iters = cfg.optim["epochs"] OFFCIAL_EPOCH_LENGTH. This is confusing for me. As to my knowledge, total_iters should be equal to epoch number of training data / batch size.
Also, when training with multiple GPUs, I observed that each GPU will run total_iters instead of total_iters / number of GPUs. It also confuses me.
Is anyone know what is the purpose of this design?
In addition, if I want to train my custom dataset with the iters which recognized by the public, what should I do? Just set the total_iters as epoch * number of training data / batch size? Is there any other setting I need to modify? Thanks!