ManifoldRG / NEKO

In Progress Implementation of GATO style Generalist Multimodal model capable of image, text, RL and Robotics tasks
https://discord.gg/brsPnzNd8h
GNU General Public License v3.0
46 stars 10 forks source link

Distributed Training #11

Open daniellawson9999 opened 1 year ago

daniellawson9999 commented 1 year ago

We have a guide on doing distributed training w/ Vast here: https://docs.google.com/document/d/1W_dN3qarCOcLRDdEZ75LBtkLGiwUziWWDtVTjd43Ad4/edit?usp=sharing . However, we have not performed full distributed training runs. This issue does not specify specific issues, but general things we should consider and things we may want to research.

Currently, we have a limited TPU grant, and will want to validate our training on distributed TPUs. This can be configured, as we are using Hugginface's Accelerate, but we should review best practices: https://huggingface.co/docs/accelerate/concept_guides/training_tpu

As we develop, we should occasionally validate on any distributed hardware (e.g. just get a vast instance for a couple of hours and verify), as most of the things we need to validate or consider will transfer. E.g, this code block has a wait_for_everyone which is needed before saving in distributed runs (regardless of TPU vs GPU):

https://github.com/ManifoldRG/gato-control/blob/c906c50f5ffeb755da8e36c84a5e14a7a2566e31/gato/training/trainer.py#L98-L104

daniellawson9999 commented 1 year ago

add config and test: if training_args.gradient_checkpointing: model.gradient_checkpointing_enable()

https://huggingface.co/docs/transformers/v4.18.0/en/performance