Open ArkashJ opened 3 weeks ago
Hye,
The correct definition is indeed mentioned in the official documentation: https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html#resume-training-state
I think maybe because of a previous version, the wrong solutions have been popularized.
Yup I went to the docs to figure out the correct solution. I hope this gets sufficient attention because most of the top results and AI generated answers are wrong.
Boston University Class of 2024 MS in Computer Science (2022-2024) BA in Mathematics and Computer Science (2020-2024) https://www.arkashj.com/ +1 857-701-6117| linkedin.com/in/arkashj https://www.linkedin.com/in/arkashj | http://goog_2001913241 https://github.com/ArkashJ
On Thu, Oct 24, 2024 at 8:02 PM Arijit Ghosh @.***> wrote:
Hye,
The correct definition is indeed mentioned in the official documentation: https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html#resume-training-state
I think maybe because of a previous version, the wrong solutions have been popularized.
— Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/pytorch-lightning/issues/20361#issuecomment-2436540903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUWI2YCZZL6OF4APJDYBJOTZ5GDARAVCNFSM6AAAAABQPK5PXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZWGU2DAOJQGM . You are receiving this because you authored the thread.Message ID: @.***>
📚 Documentation
There's a lot of documentation out there about using the
resume_from_checkpoint
keyword in a pytorch trainer however this is wrong. In the latest pytorch version, one needs to provide the path to the checkpoint (.ckpt file) itself in the fit function for the trainer to get it going. here's some popular incorrect references - 1) https://stackoverflow.com/questions/71961436/pytorch-lightning-resuming-from-checkpoint-with-new-data 2) https://lightning.ai/forums/t/how-to-resume-training/432 3) https://github.com/Lightning-AI/pytorch-lightning/discussions/12845 4) https://www.youtube.com/watch?v=V5KGEzIwAxQChatGPT and claude also got this wrong: ![Uploading Screenshot 2024-10-23 at 1.38.11 PM.png…]()
I wanted this to get visibility because knowing how to resume training from checkpoints is imperative and there's a lot of wrong information out there!
cc @borda