Closed Elan456 closed 2 weeks ago
In the model_config.yaml
of the .nemo checkpoint downloaded from Hugging Face, the tensor_model_parallel_size
is set to 4:
tensor_model_parallel_size: 4
If you untar the .nemo
checkpoint, change the value of tensor_model_parallel_size
to 1 and then retar the .nemo checkpoint, it will allow the merge_lora_weights/merge.py
script to work with a single GPU.
Untar the nemo checkpoint
tar -xf minitron-4b-base.nemo
Move the original checkpoint to a safe location to avoid needing to redownload if something goes wrong and to get it out of the way for running tar later
mv minitron-4b-base.nemo ../
Modify tensor_model_parallel_size
vim model_config.yaml
mcore_gpt: true
micro_batch_size: 4
global_batch_size: 1152
-- tensor_model_parallel_size: 4
++ tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
virtual_pipeline_model_parallel_size: null
encoder_seq_length: 4096
max_position_embeddings: 4096
num_layers: 32
hidden_size: 3072
...
tar -cvf minitron-4b-base.nemo *
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Describe the bug
When running
merge_lora_weights/merge.py
with TP and PP set to 1 on a fine-tuned minitron checkpoint, I run into the following error:The world size should be 1 because the node I'm using only has a single A100 GPU, however, it is unclear why it's trying to split by 4.
Link to
parallel_state.py
where the error is raised: https://github.com/NVIDIA/Megatron-LM/blob/73e7b58e79df9da521ff31d74053579b7a060c7e/megatron/core/parallel_state.py#L531Full Traceback
Steps/Code to reproduce bug
Below is the shell script I'm running to merge the lora weights.
Here is the script to fine-tune the model:
Environment overview
Built my NeMo container based on the dev tag, and then added the lm-evaluation-harness.
nemo_eval.def
The commands to build the container:
apptainer build nemo_eval.sif nemo_eval.def
Additional context