About loading and saving llama model of pretraining job

etsurin commented 11 months ago

I am running a pretraining job following https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md And the script works smoothly. However, I don't know where the trained model is saved. I have set

model.save_xser = True

in https://github.com/aws-neuron/neuronx-nemo-megatron/blob/da1fb6643838e01c9110723bb4190081b4a249b0/nemo/examples/nlp/language_modeling/test_llama.sh But I still can not find the trained model anywhere.

What is more, I am not sure if the script loads the parameters from the model.tokenizer.type folder to initialize llama model ?

It seems like the example script uses randomly initialized parameters for training, so what should I do if I want to initialize the model with pretrained parameters? Should the parameter file restricted to ckpt file?

mrnikwaws commented 11 months ago

Hi @etsurin,

By default model checkpoint saving should be relative to your working directory for training. For example if you are using megatron llama under nemo_experiments, and AWS Parallel Cluster(which runs slurm) you should see checkpoint files under: nemo_experiments/megatron_llama/$SLURM_JOB_ID/checkpoints

In terms of initializing the parameters - this is a pretraining script, so normally it would start from random initialization and reload training in the event of a crash by modifying the script like this:

# Note: to resume training using a checkpoint, please add the following configuration above, adjusting for your checkpoint path
#    model.use_cpu_initialization=False \
#    +model.load_xser=True \
#    model.resume_from_checkpoint='/efs/checkpoint/megatron_gpt--step\=1085-consumed_samples\=69632.0-last.ckpt' \

which is commented at the end in the script file you've linked. You can add code to your scripts to detect failures and act accordingly.

If you want to start by loading your own pretrained model you should be able to load it in training mode and start training by modifying the training script, not to initialize weights.

Model tuning (where you start with pretrained model and tune part of the model, all of the model, or an attached adapter) is a little different. Here is some (not yet merged) code if you need some hints for that: https://github.com/aws-neuron/aws-neuron-parallelcluster-samples-staging/pull/20/files. This will appear soon under the tutorial content

hannanjgaws commented 10 months ago

Since we haven't heard back from you, we'll assume we've answered your questions and will close this issue. Please feel free to reopen this issue or create a new ticket if you're still encountering any issues running your model.

aws-neuron / transformers-neuronx

About loading and saving llama model of pretraining job #53