LambdaLabsML / distributed-training-guide

Best practices & guides on how to write distributed pytorch training code
8 stars 0 forks source link
cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm

Distributed Training Guide

This guide aims at a comprehensive guide on best practices for distributed training, diagnosing errors, and fully utilize all resources available.

Questions this guide answers:


Best practices for logging stdout/stderr and wandb are also included, as logging is vitally important in diagnosing/debugging training runs on a cluster.

How to read

This guide is organized into sequential chapters, each with a README.md and a train_llm.py script in them. The readme will discuss the changes introduced in that chapter, and go into more details.

Each of the training scripts is aimed at training a causal language model (i.e. gpt).

Set up

Clone this repo

git clone https://github.com/LambdaLabsML/distributed-training-guide.git

Virtual Environment

cd distributed-training-guide
python3 -m venv venv
source venv/bin/activate
python -m pip install -U pip
pip install -U setuptools wheel
pip install -r requirements.txt

wandb

This tutorial uses wandb as an experiment tracker.

wandb login