Open RogerChern opened 11 months ago
Hi! The setup
that you shared in your first snippet is very different to the setup
in https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py#L66. Can you share all changes that you made to the repo? You can do:
git diff > changes.diff
And then post the changes.diff
file here.
cc @awaelchli in case you are familiar
Same problem here.
I'm training tinyllama with 8 A40s. Everything goes very smooth until I want to increase the micro batch size for better computation to communication ratio.
I follow the official tutorial of lit gpt by passing
activation_checkpointing_policy={Block}
into FSDPStrategy. The modified setup is also attached below.But I got some strange errors about the activation checkpointing. Could someone shed some light on this, anything informative is a big help for me.