Flytekit checkpoint improvement- pytorch

flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.

Apache License 2.0

5.17k stars 550 forks source link

To improve checkpointing performance in Flytekit for PyTorch, leveraging asynchronous checkpointing as described in the PyTorch blog is a viable approach. This method reduces the downtime for training due to checkpointing by moving the final checkpointing process off the critical path to CPU threads, allowing GPU training to continue.

Would you like more details on how to integrate this into Flytekit?

Sources:

PyTorch Blog on Reducing Checkpointing Times

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

flyteorg / flyte

Flytekit checkpoint improvement- pytorch #5488

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?