Add Option for Memory Optimized Training via Gradient Checkpointing - Githubissues

ReaLLMASIC / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.

MIT License

23 stars 17 forks source link

Add Option for Memory Optimized Training via Gradient Checkpointing #178

Closed klei22 closed 2 months ago

klei22 commented 2 months ago

Description:

This PR introduces gradient checkpointing to reduce memory usage during model training.

Changes:

Block Class: Implemented gradient checkpointing using torch.utils.checkpoint, ensuring use_reentrant=False and requires_grad=True for inputs.
GPT Class: Applied gradient checkpointing to each transformer block.

Benefits:

Memory Reduction: Reduced GPU memory usage from around 19000MB to around 4000MB on the Shakespeare dataset with character tokenization, 512 context length and the softplus variation.
Feasibility: Enables training larger models on existing hardware.

Trade-offs:

Increased Computation Time: Activations are recomputed during the backward pass.
Added Complexity: Additional complexity in the model implementation and training loop.

Checklist:

[x] Implemented gradient checkpointing in Block and GPT classes.
[x] Tested for correct functionality and memory optimization.
[x] Verified memory usage reduction on Shakespeare character dataset.
[x] Updated documentation to reflect changes.
[x] Ensured backward compatibility.

klei22 commented 2 months ago

Wanted to note for maximum memory savings, do not add the --compile flag.

For a context length of 1024, (gc will stand for 'gradient checkpointing')

gc, no-compile -- 11074MB
gc, compile -- 22846MB
no-gc, no-compile -- OOM
no-gc, compile -- OOM