Is there a plan to support multi-node traning?

eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)

Apache License 2.0

2.21k stars 185 forks source link

Is there a plan to support multi-node traning? #5

Open huybery opened 1 year ago

huybery commented 1 year ago

I haven't found a good multi-node best practice for FSDP, have you tried it? Thank you in advance. :)

eric-mitchell commented 1 year ago

Multi-node training is something we're planning to start looking into very soon (in the next week). Unfortunately our cluster is down for maintenance for the next ~5 days, so we won't be able to do any development/testing before then. Trying to secure alternative compute, but unsure if anything will come through before our cluster is back.

If you have access to a multi-node cluster, I think you could try running our code in multinode with relatively few modifications. The discussion in this issue might be a starting point for what needs to change when going from single node to multinode. Happy to discuss/debug if you have the time/compute to try it out yourself :)

huybery commented 1 year ago

Thanks for your quickly response ! I'm working on modifications to the multi-node code, but at the moment I'm running into some obstacles. It will hang in the multi-node. I'd be happy to help you debug the multi-node code together, maybe you can develop a version first for me to perform the debugging? I'm worried about missing something key points, as I'm not familiar with FSDP.

liumingzhu6060 commented 1 year ago

Excuse me, is multi-node traning almost ready？

eric-mitchell commented 1 year ago

Sorry for the slow progress on this- the last few weeks have been much busier than expected. I don't have a clear timeline for multi-node at this point, unfortunately. I might be able to test some things this week, but with ICML prep I'm not 100% sure.

AltenLi commented 1 year ago

LMXKO commented 1 year ago