Distributed Training - Githubissues

We have a guide on doing distributed training w/ Vast here: https://docs.google.com/document/d/1W_dN3qarCOcLRDdEZ75LBtkLGiwUziWWDtVTjd43Ad4/edit?usp=sharing . However, we have not performed full distributed training runs. This issue does not specify specific issues, but general things we should consider and things we may want to research.

Currently, we have a limited TPU grant, and will want to validate our training on distributed TPUs. This can be configured, as we are using Hugginface's Accelerate, but we should review best practices: https://huggingface.co/docs/accelerate/concept_guides/training_tpu

As we develop, we should occasionally validate on any distributed hardware (e.g. just get a vast instance for a couple of hours and verify), as most of the things we need to validate or consider will transfer. E.g, this code block has a wait_for_everyone which is needed before saving in distributed runs (regardless of TPU vs GPU):

https://github.com/ManifoldRG/gato-control/blob/c906c50f5ffeb755da8e36c84a5e14a7a2566e31/gato/training/trainer.py#L98-L104

ManifoldRG / NEKO

Distributed Training #11