llama2 training hangs when pp_size > 1

PurvangL commented 6 months ago

Describe the bug

I am following guide to fine tune llama2-7B model on 2 nodes (H100).

my training hangs at dalaloader sanity checking.

[NeMo I 2024-05-08 21:28:45 modelPT:724] Optimizer config = MegatronDistributedFusedAdam (
    Parameter Group 0
        betas: [0.9, 0.98]
        bias_correction: True
        eps: 1e-08
        is_expert: False
        lr: 5e-06
        weight_decay: 0.01
    )
[NeMo I 2024-05-08 21:28:45 lr_scheduler:772] Scheduler not initialized as no `sched` config supplied to setup_optimizer()

  | Name  | Type          | Params
----------------------------------------
0 | model | Float16Module | 437 M 
----------------------------------------
437 M     Trainable params
0         Non-trainable params
437 M     Total params
1,751.384 Total estimated model params size (MB)
Sanity Checking: |                                                                                                                                                                              | 0/? [00:00<?, ?it/s]
[NeMo W 2024-05-08 21:28:45 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=25` in the `DataLoader` to improve performance.

[NeMo W 2024-05-08 21:28:45 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.

[NeMo W 2024-05-08 21:28:45 nemo_logging:349] /opt/apex/apex/transformer/pipeline_parallel/utils.py:81: UserWarning: This function is only for unittest
      warnings.warn("This function is only for unittest")

Sanity Checking DataLoader 0:   0%|                                                                                                                                                             | 0/2 [00:00<?, ?it/s]
[rank0]:[I ProcessGroupNCCL.cpp:1970] [PG 5 Rank 0] NCCL_DEBUG: N/A
[rank4]:[I ProcessGroupNCCL.cpp:1970] [PG 5 Rank 0] NCCL_DEBUG: N/A
Sanity Checking DataLoader 0:  50%|██████████████████████████████████████████████████████████████████████████▌                                                                          | 1/2 [00:13<00:13,  0.08it/s]

A clear and concise description of what the bug is.

Steps/Code to reproduce bug

docker image: nvcr.io/nvidia/nemo:24.03.01.framework follow guide to run llama2-7B command I run on each node

time MELLANOX_VISIBLE_DEVICES=all NCCL_IB_HCA=^mlx5 NCCL_IB_DISABLE=1 NCCL_P2P_DISABLE=1 NCCL_SOCKET_IFNAME=ens2 torchrun --nproc_per_node=8 --nnodes=2  --node_rank=${RANK} --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=x.x.x.x:12312  /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py     trainer.precision=bf16    trainer.devices=8 trainer.num_nodes=2    trainer.val_check_interval=1.0   trainer.max_steps=${STEPS}    model.restore_from_path=${MODEL}    model.micro_batch_size=${MBS}    model.global_batch_size=${BS}  model.tensor_model_parallel_size=${TP_SIZE} model.activations_checkpoint_num_layers=1    model.pipeline_model_parallel_size=${PP_SIZE}    model.megatron_amp_O2=True     model.sequence_parallel=False    model.activations_checkpoint_granularity=full     model.activations_checkpoint_method=uniform    model.optim.name=distributed_fused_adam     model.optim.lr=5e-6    model.answer_only_loss=True     model.data.train_ds.file_names=${TRAIN}     model.data.test_ds.file_names=${TEST}    model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS}     model.data.train_ds.max_seq_length=512    model.data.validation_ds.max_seq_length=512    model.data.train_ds.micro_batch_size=${MBS} model.data.train_ds.global_batch_size=${BS}     model.data.validation_ds.micro_batch_size=${MBS}  model.data.validation_ds.global_batch_size=${BS}    model.data.test_ds.micro_batch_size=${MBS}    model.data.test_ds.global_batch_size=${BS}    model.data.train_ds.num_workers=0    model.data.validation_ds.num_workers=0     model.data.test_ds.num_workers=0    model.data.validation_ds.metric.name=loss    model.data.test_ds.metric.name=loss     exp_manager.create_wandb_logger=False    exp_manager.explicit_log_dir=/workspace/result    exp_manager.resume_if_exists=False    exp_manager.resume_ignore_no_checkpoint=True    exp_manager.create_checkpoint_callback=False     exp_manager.checkpoint_callback_params.monitor=train_loss    exp_manager.checkpoint_callback_params.save_best_model=False  exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True    ++cluster_type=BCP  model.data.validation_ds.file_names=${VALID}

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)] : Docker
Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
If method of install is [Docker], provide docker pull & docker run commands used docker pull nvcr.io/nvidia/nemo:24.03.01.framework

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Add any other context about the problem here. Example: GPU model : 16xH100

Please let me know if any other information needed. Thank you

ericharper commented 5 months ago

@maanug-nv , could you look at this one?

PurvangL commented 5 months ago

@ericharper , @maanug-nv ; I also tried running on slurm cluster, please find logs below.

[NeMo I 2024-05-23 11:06:21 megatron_gpt_sft:176]                                                                                                                                                                     
    name: megatron_gpt_sft                                                                                                                                                                                            
    trainer:                                                                                                                                                                                                          
      devices:8                                                                                                                                                                                                      
      accelerator:gpu                                                                                                                                                                                                
      num_nodes:2                                                                                                                                                                                                    
      precision:bf16                                                                                                                                                                                                 
      logger:false                                                                                                                                                                                                   
      enable_checkpointing:false                                                                                                                                                                                     
      use_distributed_sampler:false                                                                                                                                                                                  
      max_epochs:9999                                                                                                                                                                                                
      max_steps:50                                                                                                                                                                                                   
      log_every_n_steps:10                                                                                                                                                                                           
      val_check_interval:1.0                                                                                                                                                                                         
      gradient_clip_val:1.0                                                                                                                                                                                          
    exp_manager:                                                                                                                                                                                                      
      explicit_log_dir:/workspace/result                                                                                                                                                                             
      exp_dir:null                                                                                                                                                                                                   
      name:${name}                                                                                                                                                                                                   
      create_wandb_logger:false                                                                                                                                                                                      
      wandb_logger_kwargs:                                                                                                                                                                                            
        project:null                                                                                                                                                                                                 
        name:null                                                                                                                                                                                                    
      resume_if_exists:true                                                                                                                                                                                          
      resume_ignore_no_checkpoint:true                                                                                                                                                                               
      create_checkpoint_callback:true                                                                                                                                                                                
      checkpoint_callback_params:                                                                                                                                                                                     
        monitor: validation_loss                                                                                                                                                                                      
        save_top_k:2                                                                                                                                                                                                 
        mode:max                                                                                                                                                                                                     
        save_nemo_on_train_end:true                                                                                                                                                                                  
        filename: megatron_gpt_sft--{${exp_manager.checkpoint_callback_params.monitor}:.3f}-{step}-{consumed_samples}                                                                                                 
        model_parallel_size:${model.tensor_model_parallel_size}                                                                                                                                                      
        save_best_model:false                                                                                                                                                                                        
    model:                                                                                                                                                                                                            
      seed:1234                                                                                                                                                                                                      
      tensor_model_parallel_size:4                                                                                                                                                                                   
      pipeline_model_parallel_size:4                                                                                                                                                                                 
      global_batch_size:128                                                                                                                                                                                          
      micro_batch_size:1                                                                                                                                                                                             
      restore_from_path:/workspace/llama27b.nemo                                                                                                                                                                    
      resume_from_checkpoint:null                                                                                                                                                                                    
      save_nemo_on_validation_end:true                                                                                                                                                                               
      sync_batch_comm:false                                                                                                                                                                                          
      megatron_amp_O2:true                                                                                                                                                                                           
      sequence_parallel:true                                                                                                                                                                                         
      activations_checkpoint_granularity:selective                                                                                                                                                                   
      activations_checkpoint_method: uniform  
      activations_checkpoint_num_layers: null
      activations_checkpoint_layers_per_pipeline: null
      answer_only_loss: true
      gradient_as_bucket_view: false
      seq_len_interpolation_factor: null
      use_flash_attention: null
      hidden_dropout: 0.0
      attention_dropout: 0.0
      ffn_dropout: 0.0
      data:
        chat: false
        chat_prompt_tokens:
          system_turn_start: <extra_id_0>
          turn_start: <extra_id_1>
          label_start: <extra_id_2>
          end_of_turn: '

            '
          end_of_name: '

            '
        train_ds:
          file_names:
          - /workspace/self_instruct_data/training.jsonl
          global_batch_size: 128
          micro_batch_size: 1
          shuffle: true
          num_workers: 0
          memmap_workers: null
          pin_memory: true
          max_seq_length: 512
          min_seq_length: 1
          drop_last: true
          concat_sampling_probabilities:
          - 1
          label_key: output
          add_eos: true
          add_sep: false
          add_bos: false
          truncation_field: input
          index_mapping_dir: null
          prompt_template: '{input} {output}'
          hf_dataset: false
          truncation_method: right
        validation_ds:
          file_names:
          - /workspace/self_instruct_data/validation.jsonl
          names: null
          global_batch_size: 128
          micro_batch_size: 1
          shuffle: false
          num_workers: 0
          memmap_workers: ${model.data.train_ds.memmap_workers}
          pin_memory: true
          max_seq_length: 512
          min_seq_length: 1
          drop_last: false
          label_key: ${model.data.train_ds.label_key} 
          add_eos: ${model.data.train_ds.add_eos}
          add_sep: ${model.data.train_ds.add_sep}
          add_bos: ${model.data.train_ds.add_bos}
          write_predictions_to_file: false
          output_file_path_prefix: null
          truncation_field: ${model.data.train_ds.truncation_field}
          index_mapping_dir: null
          prompt_template: ${model.data.train_ds.prompt_template}
          tokens_to_generate: 32
          hf_dataset: false
          truncation_method: right
          metric:
            name: loss
            average: null
            num_classes: null
        test_ds:
          file_names:
          - /workspace/self_instruct_data/test.jsonl
          names: null
          global_batch_size: 256
          micro_batch_size: 1
          shuffle: false
          num_workers: 0
          memmap_workers: ${model.data.train_ds.memmap_workers}
          pin_memory: true
          max_seq_length: ${model.data.train_ds.max_seq_length}
          min_seq_length: 1
          drop_last: false
          label_key: ${model.data.train_ds.label_key} 
          add_eos: ${model.data.train_ds.add_eos}
          add_sep: ${model.data.train_ds.add_sep}
          add_bos: ${model.data.train_ds.add_bos}
          write_predictions_to_file: false
          output_file_path_prefix: null
          truncation_field: ${model.data.train_ds.truncation_field}
          index_mapping_dir: null
          prompt_template: ${model.data.train_ds.prompt_template}
          tokens_to_generate: 32
          hf_dataset: false
          truncation_method: right
          metric:
            name: loss
            average: null
            num_classes: null
      optim:
        name: distributed_fused_adam
        lr: 5.0e-06
        weight_decay: 0.01
        betas:
        - 0.9
        - 0.98
    inference:
      greedy: true
      top_k: 0
      top_p: 0.9
      temperature: 1.0
      all_probs: false
      repetition_penalty: 1.2
      min_tokens_to_generate: 0
      compute_logprob: false
      compute_attention_mask: true
    cluster_type: BCP

[NeMo W 2024-05-23 11:06:21 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:554: UserWarning: bf16 is supported for historical reasons but its usage is discouraged. Please se
t your precision to bf16-mixed instead!
      rank_zero_warn(

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
[NeMo E 2024-05-23 11:06:21 exp_manager:556] You are running multi-node training without SLURM handling the processes. Please note that this is not tested in NeMo and could result in errors.
[NeMo W 2024-05-23 11:06:21 exp_manager:708] Exp_manager is logging to /workspace/result, but it already exists.
[NeMo W 2024-05-23 11:06:21 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/result/checkpoints. Training from scratch.
[NeMo I 2024-05-23 11:06:21 exp_manager:396] Experiments will be logged at /workspace/result
[NeMo I 2024-05-23 11:06:21 exp_manager:856] TensorboardLogger has been set up
[NeMo W 2024-05-23 11:06:21 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 50. Please ensure that max_steps will run for at least 1 epochs to ensu
re that checkpointing will not error out.
[NeMo I 2024-05-23 11:06:21 megatron_gpt_sft:213] Resuming training from checkpoint: None
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[NeMo I 2024-05-23 11:06:28 megatron_init:253] Rank 0 has data parallel group : [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:259] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:264] All data parallel group ranks with context parallel combined: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:267] Ranks 0 has data parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:284] Rank 0 has context parallel group: [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:287] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:288] Ranks 0 has context parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:299] Rank 0 has model parallel group: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[NeMo I 2024-05-23 11:06:28 megatron_init:300] All model parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:310] Rank 0 has tensor model parallel group: [0, 1, 2, 3]
[NeMo I 2024-05-23 11:06:28 megatron_init:314] All tensor model parallel group ranks: [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:315] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:344] Rank 0 has pipeline model parallel group: [0, 4, 8, 12]
[NeMo I 2024-05-23 11:06:28 megatron_init:356] Rank 0 has embedding group: [0, 12]
[NeMo I 2024-05-23 11:06:28 megatron_init:362] All pipeline model parallel group ranks: [[0, 4, 8, 12], [1, 5, 9, 13], [2, 6, 10, 14], [3, 7, 11, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:363] Rank 0 has pipeline model parallel rank 0
[NeMo I 2024-05-23 11:06:28 megatron_init:364] All embedding group ranks: [[0, 4, 8, 12], [1, 5, 9, 13], [2, 6, 10, 14], [3, 7, 11, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:365] Rank 0 has embedding rank: 0
24-05-23 11:06:28 - PID:154683 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 128
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[NeMo I 2024-05-23 11:06:28 tokenizer_utils:185] Getting SentencePiece with model: /tmp/tmpyuitwp3o/a290efe8ded54b8da6a27eb8ecea4895_tokenizer.model
[NeMo I 2024-05-23 11:06:28 megatron_base_model:574] Padded vocab_size: 32256, original vocab_size: 32000, dummy tokens: 256.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:489] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: add_qkv_bias in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: rotary_interleaved in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: window_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: memory_efficient_layer_norm in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_mapping to make to make it
 configurable.

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/16
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:12312.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:12312.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 12312.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/16
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/16
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/16
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
Matplotlib created a temporary cache directory at /tmp/matplotlib-518pojrm because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-qze86_xq because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-rrzai_1n because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-449boyjo because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-9w_wgl4h because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-d5pwia0k because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-w0euwkph because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-wgywjpl6 because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
…. Continue retrying

PurvangL commented 5 months ago

@ericharper @maanug-nv , following up regarding issue posted above. please let me know if any other information needed.

maanug-nv commented 4 months ago

Hi @PurvangL , I see you've closed this issue, were you able to resolve? I haven't had time to reproduce this issue with SFT, but I've encountered long init times with pretraining that might seem like hangs, but eventually start training. Sorry for lack of response, if I can get around to reproducing this specific case, I'll let you know. We are also looking into these long init times.

PurvangL commented 4 months ago

Hi @maanug-nv Removing NCCL_P2P_LEVEL= NVL or PIX environment variables and increase per process memory to infinity helped.

NVIDIA / NeMo

llama2 training hangs when pp_size > 1 #9146