aws-samples / amazon-eks-machine-learning-with-terraform-and-kubeflow

Distributed training using Kubeflow on Amazon EKS
Apache License 2.0
79 stars 43 forks source link

Neuronx distributed Llama2 examples do not load latest checkpoint if it exists #97

Closed ajayvohra2005 closed 4 months ago

ajayvohra2005 commented 4 months ago

Neuronx distributed examples by default should load latest checkpoint, if it exists. This applies to Llama2-7b, Llama2-13b, and Llama2-70b examples.