Ever wondered how to train a large neural network across a giant cluster? Look no further!
This is a comprehensive guide on best practices for distributed training, diagnosing errors, and fully utilizing all resources available. It is organized into sequential chapters, each with a README.md
and a train_llm.py
script in them. The readme will discuss both the high level concepts of distributed training, and the code changes introduced in that chapter.
The guide is written entirely in very minimal standard pytorch, using transformers
and datasets
for models and data, respectively. No other library is used for distributed code - the distributed stuff is entirely in pytorch.
Questions this guide answers:
Best practices for logging stdout/stderr and wandb are also included, as logging is vitally important in diagnosing/debugging training runs on a cluster.
Each of the training scripts is aimed at training a causal language model (i.e. gpt/llama).
git clone https://github.com/LambdaLabsML/distributed-training-guide.git
cd distributed-training-guide
python3 -m venv venv
source venv/bin/activate
python -m pip install -U pip
pip install -U setuptools wheel
pip install -r requirements.txt
This tutorial uses wandb
as an experiment tracker.
wandb login
🦄 Other exciting ML projects at Lambda: ML Times, Text2Video, GPU Benchmark.