Investigation: Loss spikes when loss developes close to convergence

Modalities / modalities

Modalities, a PyTorch-native framework for distributed and reproducible foundation model training.

MIT License

59 stars 5 forks source link

Open le1nux opened 4 months ago

le1nux commented 4 months ago

During the training of 3.6B and 7B with FSDP we experienced a loss spike after the loss as the model was moving towards convergence.

Things that we should check in our implementation:

[x] Correctness of gradient clipping with FSDP
[x] Exploration of implementational differences of AdamW and Adam in MegetronLM and Pytorch
[x] Weight initialization
[ ] ~~GPT2 implementation (we could train a small model directly from Huggingface for comparison)~~

le1nux commented 3 months ago

Addressed in PR #143