aws-samples / amazon-eks-machine-learning-with-terraform-and-kubeflow

Distributed training using Kubeflow on Amazon EKS
Apache License 2.0
82 stars 42 forks source link

neuronx-nemo-megatron examples need checkpointing enabled #94

Closed ajayvohra2005 closed 6 months ago

ajayvohra2005 commented 6 months ago

neuron-nemo-megatron examples currently have checkpointing effectively disabled by default, and do not load existing checkpoints, if any.

To be consistent with neuronx-distributed examples, need to enable checkpointing every 100 steps.