Closed etsurin closed 10 months ago
Hi @etsurin,
By default model checkpoint saving should be relative to your working directory for training. For example if you are using megatron llama under nemo_experiments, and AWS Parallel Cluster(which runs slurm) you should see checkpoint files under: nemo_experiments/megatron_llama/$SLURM_JOB_ID/checkpoints
In terms of initializing the parameters - this is a pretraining script, so normally it would start from random initialization and reload training in the event of a crash by modifying the script like this:
# Note: to resume training using a checkpoint, please add the following configuration above, adjusting for your checkpoint path
# model.use_cpu_initialization=False \
# +model.load_xser=True \
# model.resume_from_checkpoint='/efs/checkpoint/megatron_gpt--step\=1085-consumed_samples\=69632.0-last.ckpt' \
which is commented at the end in the script file you've linked. You can add code to your scripts to detect failures and act accordingly.
If you want to start by loading your own pretrained model you should be able to load it in training mode and start training by modifying the training script, not to initialize weights.
Model tuning (where you start with pretrained model and tune part of the model, all of the model, or an attached adapter) is a little different. Here is some (not yet merged) code if you need some hints for that: https://github.com/aws-neuron/aws-neuron-parallelcluster-samples-staging/pull/20/files. This will appear soon under the tutorial content
Since we haven't heard back from you, we'll assume we've answered your questions and will close this issue. Please feel free to reopen this issue or create a new ticket if you're still encountering any issues running your model.
I am running a pretraining job following https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md And the script works smoothly. However, I don't know where the trained model is saved. I have set
in https://github.com/aws-neuron/neuronx-nemo-megatron/blob/da1fb6643838e01c9110723bb4190081b4a249b0/nemo/examples/nlp/language_modeling/test_llama.sh But I still can not find the trained model anywhere.
What is more, I am not sure if the script loads the parameters from the model.tokenizer.type folder to initialize llama model ?
It seems like the example script uses randomly initialized parameters for training, so what should I do if I want to initialize the model with pretrained parameters? Should the parameter file restricted to ckpt file?