Closed GravityZL closed 11 months ago
Are you using InfiniBand-connected nodes on AzureML?
Are you using InfiniBand-connected nodes on AzureML?
yes, now this issue has been fixed, it can run on multi nodes with no performance degradation.
Issue solved
What was the issue and how did you solve?
On Mon, Nov 13, 2023 at 1:09 AM GravityZL @.***> wrote:
Issue solved
— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/dinov2/issues/303#issuecomment-1807729857 or unsubscribe https://github.com/notifications/unsubscribe-auth/AACUARNVK4WG345KU6O3HIDYEHPWLBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVE3DEMBZGEYTCMBYQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRHE4DGOBXGA4DAMFHORZGSZ3HMVZKMY3SMVQXIZI . You are receiving this email because you commented on the thread.
Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .
@GravityZL Heyy could you share how did you solve the issue? And if you donot mind, I think your script of training DINO on multi nodes without SLURM is also valueable.
Thanks!
Hi, I have a cluster without slurm(Azureml cluster, Tesla V100). And I compared the training on single gpu, multi gpus on one node, and multi nodes. The results show that when it is trained on multi nodes, there is a significant performance degradation:
it takes 1 day to finish on one gpu and multi gpus within one node, but it takes roughly 3 days to finish the training with 2 nodes, and with four nodes, it takes even longer.
Do you have some idea what happened here? It is much appreciated if you can help.
Thank you!