Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.38k stars 3.38k forks source link

Resume training from checkpoints #20361

Open ArkashJ opened 3 weeks ago

ArkashJ commented 3 weeks ago

📚 Documentation

There's a lot of documentation out there about using the resume_from_checkpoint keyword in a pytorch trainer however this is wrong. In the latest pytorch version, one needs to provide the path to the checkpoint (.ckpt file) itself in the fit function for the trainer to get it going. here's some popular incorrect references - 1) https://stackoverflow.com/questions/71961436/pytorch-lightning-resuming-from-checkpoint-with-new-data 2) https://lightning.ai/forums/t/how-to-resume-training/432 3) https://github.com/Lightning-AI/pytorch-lightning/discussions/12845 4) https://www.youtube.com/watch?v=V5KGEzIwAxQ

ChatGPT and claude also got this wrong: ![Uploading Screenshot 2024-10-23 at 1.38.11 PM.png…]()

I wanted this to get visibility because knowing how to resume training from checkpoints is imperative and there's a lot of wrong information out there!

cc @borda

arijit-hub commented 3 weeks ago

Hye,

The correct definition is indeed mentioned in the official documentation: https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html#resume-training-state

I think maybe because of a previous version, the wrong solutions have been popularized.

ArkashJ commented 2 weeks ago

Yup I went to the docs to figure out the correct solution. I hope this gets sufficient attention because most of the top results and AI generated answers are wrong.

Boston University Class of 2024 MS in Computer Science (2022-2024) BA in Mathematics and Computer Science (2020-2024) https://www.arkashj.com/ +1 857-701-6117| linkedin.com/in/arkashj https://www.linkedin.com/in/arkashj | http://goog_2001913241 https://github.com/ArkashJ

On Thu, Oct 24, 2024 at 8:02 PM Arijit Ghosh @.***> wrote:

Hye,

The correct definition is indeed mentioned in the official documentation: https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html#resume-training-state

I think maybe because of a previous version, the wrong solutions have been popularized.

— Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/pytorch-lightning/issues/20361#issuecomment-2436540903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUWI2YCZZL6OF4APJDYBJOTZ5GDARAVCNFSM6AAAAABQPK5PXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZWGU2DAOJQGM . You are receiving this because you authored the thread.Message ID: @.***>