Open huybery opened 1 year ago
Multi-node training is something we're planning to start looking into very soon (in the next week). Unfortunately our cluster is down for maintenance for the next ~5 days, so we won't be able to do any development/testing before then. Trying to secure alternative compute, but unsure if anything will come through before our cluster is back.
If you have access to a multi-node cluster, I think you could try running our code in multinode with relatively few modifications. The discussion in this issue might be a starting point for what needs to change when going from single node to multinode. Happy to discuss/debug if you have the time/compute to try it out yourself :)
Thanks for your quickly response ! I'm working on modifications to the multi-node code, but at the moment I'm running into some obstacles. It will hang in the multi-node. I'd be happy to help you debug the multi-node code together, maybe you can develop a version first for me to perform the debugging? I'm worried about missing something key points, as I'm not familiar with FSDP.
Excuse me, is multi-node traning almost ready?
Sorry for the slow progress on this- the last few weeks have been much busier than expected. I don't have a clear timeline for multi-node at this point, unfortunately. I might be able to test some things this week, but with ICML prep I'm not 100% sure.
+1
+1
I haven't found a good multi-node best practice for FSDP, have you tried it? Thank you in advance. :)