facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
8.97k stars 790 forks source link

Performance degradation for multi node training #303

Closed GravityZL closed 11 months ago

GravityZL commented 11 months ago

Hi, I have a cluster without slurm(Azureml cluster, Tesla V100). And I compared the training on single gpu, multi gpus on one node, and multi nodes. The results show that when it is trained on multi nodes, there is a significant performance degradation:

it takes 1 day to finish on one gpu and multi gpus within one node, but it takes roughly 3 days to finish the training with 2 nodes, and with four nodes, it takes even longer.

Do you have some idea what happened here? It is much appreciated if you can help.

Thank you!

usuyama commented 11 months ago

Are you using InfiniBand-connected nodes on AzureML?

GravityZL commented 11 months ago

Are you using InfiniBand-connected nodes on AzureML?

yes, now this issue has been fixed, it can run on multi nodes with no performance degradation.

GravityZL commented 11 months ago

Issue solved

usuyama commented 11 months ago

What was the issue and how did you solve?

On Mon, Nov 13, 2023 at 1:09 AM GravityZL @.***> wrote:

Issue solved

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/dinov2/issues/303#issuecomment-1807729857 or unsubscribe https://github.com/notifications/unsubscribe-auth/AACUARNVK4WG345KU6O3HIDYEHPWLBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVE3DEMBZGEYTCMBYQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRHE4DGOBXGA4DAMFHORZGSZ3HMVZKMY3SMVQXIZI . You are receiving this email because you commented on the thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

TumVink commented 1 month ago

@GravityZL Heyy could you share how did you solve the issue? And if you donot mind, I think your script of training DINO on multi nodes without SLURM is also valueable.

Thanks!