aws-neuron / neuronx-nemo-megatron

30 stars 11 forks source link

The llama2 example does not restart from last check point by default #29

Open harishvs opened 4 months ago

harishvs commented 4 months ago

By default the llama2 example in nemo/examples/nlp/language_modeling/test_llama.sh does not resume from the last checkpoint.

Can we make this resume by default since that is a good user experience

mrnikwaws commented 3 months ago

Thanks @harishvs, I have passed the request to the Nemo maintainers and they are looking into the request.

aws-rhsoln commented 2 months ago

Hello @harishvs , the issue should be resolved now with the latest 2.19 release. Please give it a try with the latest neuronx-nemo-megatron and let us know if it works for you. Thanks!