Open peregilk opened 2 months ago
Thanks for reporting! A patch is in the works in https://github.com/google/maxtext/pull/895.
As an immediate workaround, you can enable async checkpointing with the config async_checkpointing=true
, which initializes the jax distributed client.
Awesome. Thanks.
I am suddenly seeing crashes after saving checkpoints. This is with code that did run perfectly earlier. However, it is after a system reinstall. Wonder if someone have seen the same issue.
The checkpoints are successfully saved. Training is however not recovering, and crashing with this error: