NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.94k stars 2.48k forks source link

Unable to merge lora weights: "world_size (1) is not divisible by 4" #10782

Open Elan456 opened 2 weeks ago

Elan456 commented 2 weeks ago

Describe the bug

When running merge_lora_weights/merge.py with TP and PP set to 1 on a fine-tuned minitron checkpoint, I run into the following error:

raise RuntimeError(f"world_size ({world_size}) is not divisible by {total_model_size}")
[rank0]: RuntimeError: world_size (1) is not divisible by 4

The world size should be 1 because the node I'm using only has a single A100 GPU, however, it is unclear why it's trying to split by 4.

Link to parallel_state.py where the error is raised: https://github.com/NVIDIA/Megatron-LM/blob/73e7b58e79df9da521ff31d74053579b7a060c7e/megatron/core/parallel_state.py#L531

Full Traceback

(base) [ema8@node0414 syn_data_PEFT_exp]$ sh merge_lora_weights.sh 
15:4: not a valid test operator:  
15:4: not a valid test operator: 12.5
21:4: not a valid test operator: (
21:4: not a valid test operator: 550.54.15
rm: cannot remove '/usr/local/cuda/compat/lib': Read-only file system
[NeMo W 2024-10-07 10:34:46 nemo_logging:349] /opt/megatron-lm/megatron/core/tensor_parallel/layers.py:289: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
      def forward(ctx, input, weight, bias, allreduce_dgrad):

[NeMo W 2024-10-07 10:34:46 nemo_logging:349] /opt/megatron-lm/megatron/core/tensor_parallel/layers.py:300: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
      def backward(ctx, grad_output):

[NeMo W 2024-10-07 10:34:46 nemo_logging:349] /opt/megatron-lm/megatron/core/tensor_parallel/layers.py:392: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
      def forward(

[NeMo W 2024-10-07 10:34:46 nemo_logging:349] /opt/megatron-lm/megatron/core/tensor_parallel/layers.py:432: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
      def backward(ctx, grad_output):

[NeMo W 2024-10-07 10:34:47 nemo_logging:349] /opt/megatron-lm/megatron/core/dist_checkpointing/strategies/torch.py:17: DeprecationWarning: `torch.distributed._sharded_tensor` will be deprecated, use `torch.distributed._shard.sharded_tensor` instead
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor

[NeMo W 2024-10-07 10:34:48 nemo_logging:349] /opt/megatron-lm/megatron/core/transformer/attention.py:29: DeprecationWarning: The 'megatron.core.transformer.custom_layers.transformer_engine' 
        module is deprecated and will be removed in 0.10.0. Please use 
        'megatron.core.extensions.transformer_engine' instead.
      from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim

[NeMo W 2024-10-07 10:34:49 nemo_logging:349] /opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_prompt_learning_model.py:181: DeprecationWarning: invalid escape sequence '\{'
      "prompt_template_fields": re.findall("\{(.*?)\}", task.prompt_template),

[NeMo W 2024-10-07 10:34:49 nemo_logging:349] /opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py:389: DeprecationWarning: invalid escape sequence '\.'
      return re.fullmatch("[0-9][0-9]\.[0-9][0-9].*", nvidia_torch_version)  # "YY.MM.*"

[NeMo W 2024-10-07 10:34:49 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/tensor_quant.py:168: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
      quantize_op_abstract = torch.library.impl_abstract("tensorrt::quantize_op")(

[NeMo W 2024-10-07 10:34:50 nemo_logging:349] /opt/NeMo/nemo/collections/nlp/modules/common/megatron/vocab_parallel_cross_entropy.py:88: DeprecationWarning: invalid escape sequence '\s'
      """

[NeMo W 2024-10-07 10:34:51 nemo_logging:349] /opt/NeMo/nemo/collections/asr/parts/utils/wfst_utils.py:1328: DeprecationWarning: invalid escape sequence '\d'
      width, height = re.findall('\d+', line)

[NeMo W 2024-10-07 10:34:51 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
      cm = get_cmap("Set1")

[NeMo W 2024-10-07 10:34:52 nemo_logging:349] /opt/NeMo/nemo/collections/asr/modules/rnnt.py:1550: DeprecationWarning: invalid escape sequence '\*'
      """

[NeMo W 2024-10-07 10:34:52 nemo_logging:349] /opt/NeMo/nemo/collections/common/data/lhotse/nemo_adapters.py:198: DeprecationWarning: invalid escape sequence '\d'
      """

[NeMo W 2024-10-07 10:34:52 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/nvidia/dali/_autograph/pyct/gast_util.py:79: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
      if get_gast_version() < LooseVersion("0.5"):

[NeMo W 2024-10-07 10:34:52 nemo_logging:349] /home/ema8/.local/lib/python3.10/site-packages/setuptools/_distutils/version.py:337: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
      other = LooseVersion(other)

[NeMo W 2024-10-07 10:34:53 nemo_logging:349] /opt/NeMo/nemo/collections/asr/parts/utils/vad_utils.py:1082: DeprecationWarning: invalid escape sequence '\s'
      data = pd.read_csv(path2ground_truth_label, sep="\s+", delimiter=None, header=None)

[NeMo W 2024-10-07 10:34:53 nemo_logging:349] /opt/NeMo/nemo/collections/asr/parts/utils/asr_batching.py:39: DeprecationWarning: invalid escape sequence '\m'
      """

[NeMo W 2024-10-07 10:34:53 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

[NeMo W 2024-10-07 10:34:53 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:571: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!

[node0414.palmetto.clemson.edu:3567856] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
Using 16bit Automatic Mixed Precision (AMP)
[NeMo W 2024-10-07 10:34:54 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/amp.py:53: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: deterministic_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cross_entropy_loss_fusion in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_disable_qkv in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_disable_fc1 in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: overlap_p2p_comm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: batch_p2p_comm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: wgrad_deferral_limit in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: pipeline_model_parallel_split_rank in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo I 2024-10-07 10:35:28 megatron_init:314] Rank 0 has data parallel group : [0]
[NeMo I 2024-10-07 10:35:28 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-10-07 10:35:28 megatron_init:325] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-10-07 10:35:28 megatron_init:328] Ranks 0 has data parallel rank: 0
[NeMo I 2024-10-07 10:35:28 megatron_init:336] Rank 0 has context parallel group: [0]
[NeMo I 2024-10-07 10:35:28 megatron_init:339] All context parallel group ranks: [[0]]
[NeMo I 2024-10-07 10:35:28 megatron_init:340] Ranks 0 has context parallel rank: 0
[NeMo I 2024-10-07 10:35:28 megatron_init:347] Rank 0 has model parallel group: [0]
[NeMo I 2024-10-07 10:35:28 megatron_init:348] All model parallel group ranks: [[0]]
[NeMo I 2024-10-07 10:35:28 megatron_init:357] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-10-07 10:35:28 megatron_init:361] All tensor model parallel group ranks: [[0]]
[NeMo I 2024-10-07 10:35:28 megatron_init:362] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-10-07 10:35:28 megatron_init:382] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2024-10-07 10:35:28 megatron_init:394] Rank 0 has embedding group: [0]
[NeMo I 2024-10-07 10:35:28 megatron_init:400] All pipeline model parallel group ranks: [[0]]
[NeMo I 2024-10-07 10:35:28 megatron_init:401] Rank 0 has pipeline model parallel rank 0
[NeMo I 2024-10-07 10:35:28 megatron_init:402] All embedding group ranks: [[0]]
[NeMo I 2024-10-07 10:35:28 megatron_init:403] Rank 0 has embedding rank: 0
setting number of microbatches to constant 288
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: deterministic_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cross_entropy_loss_fusion in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_disable_qkv in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_disable_fc1 in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: overlap_p2p_comm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: batch_p2p_comm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: wgrad_deferral_limit in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: pipeline_model_parallel_split_rank in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo I 2024-10-07 10:35:28 tokenizer_utils:196] Getting SentencePiece with model: /local_scratch/slurm.777326/tmpkg5ll76b/b1bc02bf987043f3884c39152f183238_nemotron_2_256k.model
[NeMo I 2024-10-07 10:35:28 megatron_base_model:604] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: deterministic_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cross_entropy_loss_fusion in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_disable_qkv in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: tp_comm_overlap_disable_fc1 in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: overlap_p2p_comm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: batch_p2p_comm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: wgrad_deferral_limit in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: pipeline_model_parallel_split_rank in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:1189] The model: MegatronGPTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: first_pipeline_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: last_pipeline_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: activation_func_fp8_input_store in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: window_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: qk_layernorm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: test_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: calculate_per_token_loss in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: multi_latent_attention in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: memory_efficient_layer_norm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: fp8_dot_product_attention in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: fp8_multi_head_attention in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_shared_expert_intermediate_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_shared_expert_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_router_load_balancing_type in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_router_topk in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_router_pre_softmax in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_grouped_gemm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_aux_loss_coeff in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_z_loss_coeff in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_input_jitter_eps in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_token_dropping in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_token_dispatcher_type in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_per_layer_logging in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_expert_capacity_factor in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_pad_expert_input_to_capacity in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_token_drop_policy in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: moe_layer_recompute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: disable_parameter_transpose_cache in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: enable_cuda_graph in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: external_cuda_graph in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_base_model:577] The model: MegatronGPTModel() does not have field.name: config_logger_dir in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-07 10:35:28 megatron_gpt_model:372] megatron_amp_O2 is enabled but transformer-engine is not.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Error executing job with overrides: ['gpt_model_file=/scratch/ema8/Minitron-4B-Base/nemo/minitron-4b-base.nemo', 'lora_model_path=/scratch/ema8/PEFT/results/minitron-4b-base/peft_1000_minitron-4b-base/checkpoints/megatron_gpt_peft_lora_tuning.nemo', 'merged_model_path=/scratch/ema8/PEFT/results/minitron-4b-base/peft_1000_minitron-4b-base_merged.nemo', 'tensor_model_parallel_size=1', 'pipeline_model_parallel_size=1', 'trainer.num_nodes=1', 'trainer.devices=1', 'trainer.accelerator=gpu']
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py", line 307, in <module>
[rank0]:     main()  # noqa pylint: disable=no-value-for-parameter
[rank0]:   File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
[rank0]:     _run_hydra(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank0]:     _run_app(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank0]:     run_and_report(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank0]:     raise ex
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank0]:     return func()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank0]:     lambda: hydra.run(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
[rank0]:     _ = ret.return_value
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
[rank0]:     raise self._return_value
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
[rank0]:     ret.return_value = task_function(task_cfg)
[rank0]:   File "/opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py", line 241, in main
[rank0]:     model = MegatronGPTModel.restore_from(
[rank0]:   File "/opt/NeMo/nemo/collections/nlp/models/nlp_model.py", line 478, in restore_from
[rank0]:     return super().restore_from(
[rank0]:   File "/opt/NeMo/nemo/core/classes/modelPT.py", line 468, in restore_from
[rank0]:     instance = cls._save_restore_connector.restore_from(
[rank0]:   File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 1322, in restore_from
[rank0]:     trainer.strategy.setup_environment()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 154, in setup_environment
[rank0]:     self.setup_distributed()
[rank0]:   File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 251, in setup_distributed
[rank0]:     init_model_parallel(
[rank0]:   File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 155, in init_model_parallel
[rank0]:     parallel_state.initialize_model_parallel(
[rank0]:   File "/opt/megatron-lm/megatron/core/parallel_state.py", line 532, in initialize_model_parallel
[rank0]:     raise RuntimeError(f"world_size ({world_size}) is not divisible by {total_model_size}")
[rank0]: RuntimeError: world_size (1) is not divisible by 4

Steps/Code to reproduce bug

Below is the shell script I'm running to merge the lora weights.

PATH_TO_MERGED_MODEL="/scratch/ema8/PEFT/results/minitron-4b-base/peft_1000_minitron-4b-base_merged.nemo"

MODEL="/scratch/ema8/Minitron-4B-Base/nemo/minitron-4b-base.nemo"

PATH_TO_TRAINED_MODEL="/scratch/ema8/PEFT/results/minitron-4b-base/peft_1000_minitron-4b-base/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

export HYDRA_FULL_ERROR=1

srun singularity exec --nv /home/ema8/f24-nvidia/nemo_eval.sif python /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py \
    gpt_model_file=${MODEL} \
    lora_model_path=${PATH_TO_TRAINED_MODEL} \
    merged_model_path=${PATH_TO_MERGED_MODEL} \
    tensor_model_parallel_size=1 \
    pipeline_model_parallel_size=1 \
    trainer.num_nodes=1 \
    trainer.devices=1 \
    trainer.accelerator=gpu \

Here is the script to fine-tune the model:

#!/bin/bash

# Script taken from: https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/nemoframeworkpeft.html#nemo-framework-peft-playbook

# This is the nemo model we are finetuning
# Change this to match the model you want to finetune
MODEL="/scratch/ema8/Minitron-4B-Base/nemo/minitron-4b-base.nemo"

# These are the training datasets (in our case we only have one)
TRAIN_DS="[/home/ema8/f24-nvidia/syn_data_PEFT_exp/high_school_cs_dataset/train-1000.jsonl]"

# These are the validation datasets (in our case we only have one)
VALID_DS="[/home/ema8/f24-nvidia/syn_data_PEFT_exp/high_school_cs_dataset/val.jsonl]"

# These are the test datasets (in our case we only have one)
TEST_DS="[/home/ema8/f24-nvidia/syn_data_PEFT_exp/high_school_cs_dataset/test.jsonl]"

# These are the names of the test datasets
TEST_NAMES="[high_school_cs_dataset]"

# This is the PEFT scheme that we will be using. Set to "ptuning" for P-Tuning instead of LoRA
PEFT_SCHEME="lora"

# This is the concat sampling probability. This depends on the number of files being passed in the train set
# and the sampling probability for each file. In our case, we have one training file. Note sum of concat sampling
# probabilities should be 1.0. For example, with two entries in TRAIN_DS, CONCAT_SAMPLING_PROBS might be
# "[0.3,0.7]". For three entries, CONCAT_SAMPLING_PROBS might be "[0.3,0.1,0.6]"
# NOTE: Your entry must contain a value greater than 0.0 for each file
CONCAT_SAMPLING_PROBS="[1.0]"

# This is the tensor parallel size (splitting tensors among GPUs horizontally)
# See above matrix for proper value for the given model size
TP_SIZE=1

# This is the pipeline parallel size (splitting layers among GPUs vertically)
# See above matrix for proper value for the given model size
PP_SIZE=1

# The number of nodes to run this on
# See above matrix for proper value for the given model size
NODE_COUNT=1

# The number of total GPUs used
GPU_COUNT=1

# Where to store the finetuned model and training artifacts
OUTPUT_DIR="/scratch/ema8/PEFT/results/minitron-4b-base/peft_1000_200steps_minitron-4b-base"

# Run the PEFT command by appropriately setting the values for the parameters such as the number of steps,
# model checkpoint path, batch sizes etc. For a full reference of parameter
# settings refer to the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/conf/megatron_gpt_finetuning_config.yaml
python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    trainer.log_every_n_steps=1 \
    trainer.precision=bf16 \
    trainer.devices=${GPU_COUNT} \
    trainer.num_nodes=1 \
    trainer.val_check_interval=5 \
    trainer.max_steps=200 \
    model.restore_from_path=${MODEL} \
    model.peft.peft_scheme=${PEFT_SCHEME} \
    model.peft.lora_tuning.target_modules=[attention_qkv] \
    +model.tp_comm_overlap_disable_qkv=True \
    model.micro_batch_size=1 \
    model.global_batch_size=128 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.megatron_amp_O2=True \
    model.activations_checkpoint_granularity=selective \
    model.activations_checkpoint_num_layers=null \
    model.activations_checkpoint_method=uniform \
    model.optim.name=fused_adam \
    model.optim.lr=1e-4 \
    model.answer_only_loss=True \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
    model.data.train_ds.max_seq_length=4096 \
    model.data.validation_ds.max_seq_length=4096 \
    model.data.train_ds.micro_batch_size=1 \
    model.data.train_ds.global_batch_size=128 \
    model.data.validation_ds.micro_batch_size=1 \
    model.data.validation_ds.global_batch_size=128 \
    model.data.train_ds.num_workers=0 \
    model.data.validation_ds.num_workers=0 \
    model.data.test_ds.num_workers=0 \
    model.data.validation_ds.metric.name=loss \
    model.data.test_ds.metric.name=loss \
    exp_manager.create_wandb_logger=False \
    exp_manager.checkpoint_callback_params.mode=min \
    exp_manager.explicit_log_dir=${OUTPUT_DIR} \
    exp_manager.resume_if_exists=True \
    exp_manager.resume_ignore_no_checkpoint=True \
    exp_manager.create_checkpoint_callback=True \
    exp_manager.checkpoint_callback_params.monitor=validation_loss \
    ++exp_manager.checkpoint_callback_params.save_best_model=False \
    exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
    model.save_nemo_on_validation_end=False

Environment overview

Built my NeMo container based on the dev tag, and then added the lm-evaluation-harness.

nemo_eval.def

Bootstrap: docker
From: nvcr.io/nvidia/nemo:dev

%post
    git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
    cd lm-evaluation-harness
    pip install -e .
    cd ..

The commands to build the container: apptainer build nemo_eval.sif nemo_eval.def

Additional context

Elan456 commented 2 weeks ago

Fix

In the model_config.yaml of the .nemo checkpoint downloaded from Hugging Face, the tensor_model_parallel_size is set to 4:

tensor_model_parallel_size: 4

If you untar the .nemo checkpoint, change the value of tensor_model_parallel_size to 1 and then retar the .nemo checkpoint, it will allow the merge_lora_weights/merge.py script to work with a single GPU.

Steps

  1. Untar the nemo checkpoint

    tar -xf minitron-4b-base.nemo
  2. Move the original checkpoint to a safe location to avoid needing to redownload if something goes wrong and to get it out of the way for running tar later mv minitron-4b-base.nemo ../

  3. Modify tensor_model_parallel_size

    vim model_config.yaml
mcore_gpt: true
micro_batch_size: 4
global_batch_size: 1152
-- tensor_model_parallel_size: 4
++ tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
virtual_pipeline_model_parallel_size: null
encoder_seq_length: 4096
max_position_embeddings: 4096
num_layers: 32
hidden_size: 3072
...
  1. Rebuild the nemo checkpoint (tar everything in this directory)
    tar -cvf minitron-4b-base.nemo *