Closed ajayvohra2005 closed 6 months ago
neuron-nemo-megatron examples currently have checkpointing effectively disabled by default, and do not load existing checkpoints, if any.
To be consistent with neuronx-distributed examples, need to enable checkpointing every 100 steps.
neuron-nemo-megatron examples currently have checkpointing effectively disabled by default, and do not load existing checkpoints, if any.
To be consistent with neuronx-distributed examples, need to enable checkpointing every 100 steps.