Multi-node communication problem using Slurm and NeMo Megatron official GPT docs example

Lauler commented 1 year ago

Describe the bug

We are trying to get multi-node training to work with NeMo Megatron by following the steps in the quick start steps in your GPT model training docs. We're using Slurm on an HPC, and are able to successfully train using Megatron-LM, but not with NeMo.

NeMo keeps insisting we are running multi-node training without SLURM handling the processes:

0: [NeMo E 2023-01-18 07:39:42 exp_manager:423] You are running multi-node training without SLURM handling the processes. Please note that this is not tested in NeMo and could result in errors.

and the global ranks of our GPUs seem to be incorrectly initialised as result:

0: Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
0: Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
0: Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
1: Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
1: Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
1: Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8

Steps/Code to reproduce bug

Build a singularity container by bootstrapping the NeMo docker container from NGC. We used both 22.08 and 22.09. Definition file:
```
From: nvcr.io/nvidia/nemo:22.09
```

%environment export LC_ALL=C

2. Follow steps in the [GPT model training docs
](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/gpt/gpt_training.html).
3. Launch job via Slurm. Here's our sbatch script:
```bash
#!/bin/bash -l
#SBATCH --partition=gpu
#SBATCH --qos=test
#SBATCH --account=#######  # our project account
#SBATCH --job-name=gpt_nemo
#SBATCH --nodes=2
#SBATCH --gres=gpu:4
##SBATCH --ntasks-per-node=1
##SBATCH --begin=now
##SBATCH --nodelist=mel[2001-2017]
##SBATCH --exclude=mel2080
#SBATCH --time=0-00:30:00
#SBATCH --output=logs/gpt_nemo.log

# Modules
pwd
module purge
module load Singularity-CE

## Create needed distributed env variables
# addr=$(/bin/hostname -s)
# export MASTER_ADDR=$addr
# export MASTER_PORT=16783 # Can be any unused port
export GPU_PER_NODE=4

# debugging flags (optional)
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1

# Logfile
DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
PROJECT=/project/home/p200097/faton/nemo_test # Use abs path, not symbolic link
CONTAINER_PATH=/project/home/p200097/faton/nemo_test/nemo2209.sandbox
LOGGING=$PROJECT/logs
LOGFILE="${LOGGING}/%x_${DATETIME}.log"
echo $LOGFILE

echo "SLURM_JOB_GPUS is $SLURM_JOB_GPUS"

ls -lh

cmd="srun -l --output=$LOGGING/gpt_nemo_$DATETIME.log \
      singularity exec --nv --bind $PROJECT:$PROJECT $CONTAINER_PATH \
      bash $PROJECT/training_args.sh"

$cmd

And here are the setting and launch script in training_args.sh:

/bin/hostname -s
# cp -f /mnt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py

echo "ntasks-per-node is: $SLURM_NTASKS_PER_NODE"
echo "The PROCID is: $SLURM_PROCID"
echo "SLURM_JOB_GPUS is $SLURM_JOB_GPUS"

python /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
    --config-path=/workspace/nemo/examples/nlp/language_modeling/conf \
    --config-name=megatron_gpt_config \
    trainer.devices=$GPU_PER_NODE \
    trainer.num_nodes=$SLURM_JOB_NUM_NODES \
    trainer.max_epochs=null \
    trainer.max_steps=300000 \
    trainer.val_check_interval=300 \
    trainer.log_every_n_steps=50 \
    trainer.limit_val_batches=50 \
    trainer.limit_test_batches=50 \
    trainer.accumulate_grad_batches=1 \
    trainer.precision=16 \
    model.micro_batch_size=6 \
    model.global_batch_size=192 \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    model.max_position_embeddings=1024 \
    model.encoder_seq_length=1024 \
    model.hidden_size=768 \
    model.ffn_hidden_size=3072 \
    model.num_layers=12 \
    model.num_attention_heads=12 \
    model.init_method_std=0.021 \
    model.hidden_dropout=0.1 \
    model.layernorm_epsilon=1e-5 \
    model.tokenizer.vocab_file=gpt2-vocab.json \
    model.tokenizer.merge_file=gpt2-merges.txt \
    model.data.data_prefix=[1.0,hfbpe_gpt_training_data_text_document] \
    model.data.num_workers=2 \
    model.data.seq_length=1024 \
    model.data.splits_string=\'980,10,10\' \
    model.optim.name=fused_adam \
    model.optim.lr=6e-4 \
    model.optim.betas=[0.9,0.95] \
    model.optim.weight_decay=0.1 \
    model.optim.sched.name=CosineAnnealing \
    model.optim.sched.warmup_steps=750 \
    model.optim.sched.constant_steps=80000 \
    model.optim.sched.min_lr=6e-5 \
    exp_manager.resume_if_exists=True \
    exp_manager.resume_ignore_no_checkpoint=True \
    exp_manager.create_checkpoint_callback=True \
    exp_manager.checkpoint_callback_params.monitor=val_loss \
    exp_manager.checkpoint_callback_params.save_top_k=3 \
    exp_manager.checkpoint_callback_params.mode=min \
    exp_manager.checkpoint_callback_params.always_save_nemo=False

Expected behavior

Nemo/Pytorch Lightning recognizing job is run through slurm and starting the job successfully.

Environment overview (please complete the following information)

Environment location: Using Slurm on a HPC and singularity containers created from NVIDIA's NGC NeMo containers (tested with 22.08 and 22.09)
Method of NeMo install: Pre-installed in container.

Additional context 1 node in our case consists of 4 A100 GPUs.

We saw that you referred to the Pytorch Lightning documentation when asked about multi-node training in this previous issue. However, the Pytorch Lightning docs' example sbatch script has a setting that makes no sense to us:

#SBATCH --ntasks-per-node=8   # This needs to match Trainer(devices=...)

If we set --ntasks-per-node=4 this creates 4 separate processes in a node consisting of 4 GPUs, and each GPU is placed in a separate process, with only a single GPU being available per process. We tried the above method, and it only resulted in training crashing because NeMo/Lightning expected 4 devices (0, 1, 2, 3) but only saw one device (0) per process.

In the github issue thread we referenced, you write that you guys use Slurm internally. Could you provide a working example of launching a multi-node job with NeMo Megatron using sbatch and the example in your docs?

Log outputs:

1: WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (510) bind mounts
0: WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (510) bind mounts
1: mel2014
1: ntasks-per-node is: 
1: The PROCID is: 1
1: SLURM_JOB_GPUS is 0,1,2,3
0: mel2013
0: ntasks-per-node is: 
0: The PROCID is: 0
0: SLURM_JOB_GPUS is 0,1,2,3
0: [NeMo W 2023-01-18 07:38:17 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
0: [NeMo W 2023-01-18 07:38:36 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
0: [NeMo I 2023-01-18 07:38:37 megatron_gpt_pretraining:36] 
0:     
0:     ************** Experiment configuration ***********
0: [NeMo I 2023-01-18 07:38:37 megatron_gpt_pretraining:37] 
0:     name: megatron_gpt
0:     restore_from_path: null
0:     trainer:
0:       devices: 4
0:       num_nodes: 2
0:       accelerator: gpu
0:       precision: 16
0:       logger: false
0:       enable_checkpointing: false
0:       replace_sampler_ddp: false
0:       max_epochs: null
0:       max_steps: 300000
0:       log_every_n_steps: 50
0:       val_check_interval: 300
0:       limit_val_batches: 50
0:       limit_test_batches: 50
0:       accumulate_grad_batches: 1
0:       gradient_clip_val: 1.0
0:       benchmark: false
0:       enable_model_summary: false
0:     exp_manager:
0:       explicit_log_dir: null
0:       exp_dir: null
0:       name: megatron_gpt
0:       create_wandb_logger: false
0:       wandb_logger_kwargs:
0:         project: null
0:         name: null
0:       resume_if_exists: true
0:       resume_ignore_no_checkpoint: true
0:       create_checkpoint_callback: true
0:       checkpoint_callback_params:
0:         monitor: val_loss
0:         save_top_k: 3
0:         mode: min
0:         always_save_nemo: false
0:         save_nemo_on_train_end: false
0:         filename: megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}
0:         model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
0:     model:
0:       micro_batch_size: 6
0:       global_batch_size: 192
0:       tensor_model_parallel_size: 1
0:       pipeline_model_parallel_size: 1
0:       virtual_pipeline_model_parallel_size: null
0:       encoder_seq_length: 1024
0:       max_position_embeddings: 1024
0:       num_layers: 12
0:       hidden_size: 768
0:       ffn_hidden_size: 3072
0:       num_attention_heads: 12
0:       init_method_std: 0.021
0:       use_scaled_init_method: true
0:       hidden_dropout: 0.1
0:       kv_channels: null
0:       apply_query_key_layer_scaling: true
0:       normalization: layernorm
0:       layernorm_epsilon: 1.0e-05
0:       do_layer_norm_weight_decay: false
0:       make_vocab_size_divisible_by: 128
0:       pre_process: true
0:       post_process: true
0:       persist_layer_norm: true
0:       tokenizer:
0:         library: megatron
0:         type: GPT2BPETokenizer
0:         model: null
0:         vocab_file: gpt2-vocab.json
0:         merge_file: gpt2-merges.txt
0:         delimiter: null
0:         sentencepiece_legacy: false
0:       native_amp_init_scale: 4294967296
0:       native_amp_growth_interval: 1000
0:       hysteresis: 2
0:       fp32_residual_connection: false
0:       fp16_lm_cross_entropy: false
0:       megatron_amp_O2: false
0:       grad_allreduce_chunk_size_mb: 125
0:       grad_div_ar_fusion: true
0:       gradient_accumulation_fusion: false
0:       bias_activation_fusion: true
0:       bias_dropout_add_fusion: true
0:       masked_softmax_fusion: true
0:       seed: 1234
0:       resume_from_checkpoint: null
0:       use_cpu_initialization: false
0:       onnx_safe: false
0:       apex_transformer_log_level: 30
0:       gradient_as_bucket_view: true
0:       sync_batch_comm: false
0:       activations_checkpoint_granularity: null
0:       activations_checkpoint_method: null
0:       activations_checkpoint_num_layers: null
0:       num_micro_batches_with_partial_activation_checkpoints: null
0:       activations_checkpoint_layers_per_pipeline: null
0:       sequence_parallel: false
0:       transformer_engine: false
0:       fp8: false
0:       fp8_e4m3: false
0:       fp8_hybrid: false
0:       fp8_margin: 0
0:       fp8_interval: 1
0:       fp8_amax_history_len: 1
0:       fp8_amax_compute_algo: most_recent
0:       use_emha: false
0:       data:
0:         data_prefix:
0:         - 1.0
0:         - hfbpe_gpt_training_data_text_document
0:         index_mapping_dir: null
0:         data_impl: mmap
0:         splits_string: 980,10,10
0:         seq_length: 1024
0:         skip_warmup: true
0:         num_workers: 2
0:         dataloader_type: single
0:         reset_position_ids: false
0:         reset_attention_mask: false
0:         eod_mask_loss: false
0:         validation_drop_last: true
0:       nsys_profile:
0:         enabled: false
0:         start_step: 10
0:         end_step: 10
0:         ranks:
0:         - 0
0:         gen_shape: false
0:       optim:
0:         name: fused_adam
0:         lr: 0.0006
0:         weight_decay: 0.1
0:         betas:
0:         - 0.9
0:         - 0.95
0:         sched:
0:           name: CosineAnnealing
0:           warmup_steps: 750
0:           constant_steps: 80000
0:           min_lr: 6.0e-05
0:     
0: GPU available: True (cuda), used: True
0: TPU available: False, using: 0 TPU cores
0: IPU available: False, using: 0 IPUs
0: HPU available: False, using: 0 HPUs
1: GPU available: True (cuda), used: True
1: TPU available: False, using: 0 TPU cores
1: IPU available: False, using: 0 IPUs
1: HPU available: False, using: 0 HPUs
0: [NeMo E 2023-01-18 07:38:39 exp_manager:423] You are running multi-node training without SLURM handling the processes. Please note that this is not tested in NeMo and could result in errors.
0: [NeMo W 2023-01-18 07:38:39 exp_manager:614] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
0: [NeMo W 2023-01-18 07:38:39 exp_manager:466] There was no checkpoint folder at checkpoint_dir :/project/home/p200097/faton/nemo_test/nemo_experiments/megatron_gpt/checkpoints. Training from scratch.
0: [NeMo I 2023-01-18 07:38:39 exp_manager:315] Experiments will be logged at /project/home/p200097/faton/nemo_test/nemo_experiments/megatron_gpt
0: [NeMo I 2023-01-18 07:38:39 exp_manager:704] TensorboardLogger has been set up
0: [NeMo W 2023-01-18 07:38:39 exp_manager:971] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 300000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
0: [NeMo I 2023-01-18 07:38:39 megatron_gpt_pretraining:74] Resuming training from checkpoint: None
1: 23-01-18 07:38:39 - PID:27971 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 4
0: [NeMo I 2023-01-18 07:38:39 megatron_init:223] Rank 0 has data parallel group: [0, 1, 2, 3, 4, 5, 6, 7]
0: [NeMo I 2023-01-18 07:38:39 megatron_init:226] All data parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7]]
0: [NeMo I 2023-01-18 07:38:39 megatron_init:227] Ranks 0 has data parallel rank: 0
0: [NeMo I 2023-01-18 07:38:39 megatron_init:235] Rank 0 has model parallel group: [0]
0: [NeMo I 2023-01-18 07:38:39 megatron_init:236] All model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:38:39 megatron_init:246] Rank 0 has tensor model parallel group: [0]
0: [NeMo I 2023-01-18 07:38:39 megatron_init:250] All tensor model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:38:39 megatron_init:251] Rank 0 has tensor model parallel rank: 0
0: [NeMo I 2023-01-18 07:38:39 megatron_init:265] Rank 0 has pipeline model parallel group: [0]
0: [NeMo I 2023-01-18 07:38:39 megatron_init:277] Rank 0 has embedding group: [0]
0: [NeMo I 2023-01-18 07:38:39 megatron_init:283] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:38:39 megatron_init:284] Rank 0 has pipeline model parallel rank 0
0: [NeMo I 2023-01-18 07:38:39 megatron_init:285] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:38:39 megatron_init:286] Rank 0 has embedding rank: 0
0: 23-01-18 07:38:39 - PID:23743 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 4
0: [NeMo W 2023-01-18 07:38:39 modelPT:222] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.
0: [NeMo I 2023-01-18 07:38:39 tokenizer_utils:204] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: /project/home/p200097/faton/nemo_test/gpt2-vocab.json, and merges file: /project/home/p200097/faton/nemo_test/gpt2-merges.txt
0: [NeMo I 2023-01-18 07:38:39 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /project/home/p200097/faton/nemo_test/gpt2-vocab.json, merges_files: /project/home/p200097/faton/nemo_test/gpt2-merges.txt, special_tokens_dict: {}, and use_fast: False
0: Using sep_token, but it is not set yet.
0: Using cls_token, but it is not set yet.
1: Using sep_token, but it is not set yet.
1: Using cls_token, but it is not set yet.
0: Using pad_token, but it is not set yet.
0: Using mask_token, but it is not set yet.
1: Using pad_token, but it is not set yet.
1: Using mask_token, but it is not set yet.
0: [NeMo I 2023-01-18 07:38:42 megatron_base_model:204] Padded vocab_size: 50304, original vocab_size: 50257, dummy tokens: 47.
1: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
0: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
0: [NeMo W 2023-01-18 07:39:29 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
0: [NeMo W 2023-01-18 07:39:29 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
0: [NeMo W 2023-01-18 07:39:29 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
0: [NeMo W 2023-01-18 07:39:40 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
0: [NeMo W 2023-01-18 07:39:40 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
0: [NeMo W 2023-01-18 07:39:40 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
0: [NeMo I 2023-01-18 07:39:41 megatron_gpt_pretraining:36] 
0:     
0:     ************** Experiment configuration ***********
0: [NeMo I 2023-01-18 07:39:41 megatron_gpt_pretraining:36] 
0:     
0:     ************** Experiment configuration ***********
0: [NeMo I 2023-01-18 07:39:41 megatron_gpt_pretraining:37] 
0:     name: megatron_gpt
0:     restore_from_path: null
0:     trainer:
0:       devices: 4
0:       num_nodes: 2
0:       accelerator: gpu
0:       precision: 16
0:       logger: false
0:       enable_checkpointing: false
0:       replace_sampler_ddp: false
0:       max_epochs: null
0:       max_steps: 300000
0:       log_every_n_steps: 50
0:       val_check_interval: 300
0:       limit_val_batches: 50
0:       limit_test_batches: 50
0:       accumulate_grad_batches: 1
0:       gradient_clip_val: 1.0
0:       benchmark: false
0:       enable_model_summary: false
0:     exp_manager:
0:       explicit_log_dir: null
0:       exp_dir: null
0:       name: megatron_gpt
0:       create_wandb_logger: false
0:       wandb_logger_kwargs:
0:         project: null
0:         name: null
0:       resume_if_exists: true
0:       resume_ignore_no_checkpoint: true
0:       create_checkpoint_callback: true
0:       checkpoint_callback_params:
0:         monitor: val_loss
0:         save_top_k: 3
0:         mode: min
0:         always_save_nemo: false
0:         save_nemo_on_train_end: false
0:         filename: megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}
0:         model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
0:     model:
0:       micro_batch_size: 6
0:       global_batch_size: 192
0:       tensor_model_parallel_size: 1
0:       pipeline_model_parallel_size: 1
0:       virtual_pipeline_model_parallel_size: null
0:       encoder_seq_length: 1024
0:       max_position_embeddings: 1024
0:       num_layers: 12
0:       hidden_size: 768
0:       ffn_hidden_size: 3072
0:       num_attention_heads: 12
0:       init_method_std: 0.021
0:       use_scaled_init_method: true
0:       hidden_dropout: 0.1
0:       kv_channels: null
0:       apply_query_key_layer_scaling: true
0:       normalization: layernorm
0:       layernorm_epsilon: 1.0e-05
0:       do_layer_norm_weight_decay: false
0:       make_vocab_size_divisible_by: 128
0:       pre_process: true
0:       post_process: true
0:       persist_layer_norm: true
0:       tokenizer:
0:         library: megatron
0:         type: GPT2BPETokenizer
0:         model: null
0:         vocab_file: gpt2-vocab.json
0:         merge_file: gpt2-merges.txt
0:         delimiter: null
0:         sentencepiece_legacy: false
0:       native_amp_init_scale: 4294967296
0:       native_amp_growth_interval: 1000
0:       hysteresis: 2
0:       fp32_residual_connection: false
0:       fp16_lm_cross_entropy: false
0:       megatron_amp_O2: false
0:       grad_allreduce_chunk_size_mb: 125
0:       grad_div_ar_fusion: true
0:       gradient_accumulation_fusion: false
0:       bias_activation_fusion: true
0:       bias_dropout_add_fusion: true
0:       masked_softmax_fusion: true
0:       seed: 1234
0:       resume_from_checkpoint: null
0:       use_cpu_initialization: false
0:       onnx_safe: false
0:       apex_transformer_log_level: 30
0:       gradient_as_bucket_view: true
0:       sync_batch_comm: false
0:       activations_checkpoint_granularity: null
0:       activations_checkpoint_method: null
0:       activations_checkpoint_num_layers: null
0:       num_micro_batches_with_partial_activation_checkpoints: null
0:       activations_checkpoint_layers_per_pipeline: null
0:       sequence_parallel: false
0:       transformer_engine: false
0:       fp8: false
0:       fp8_e4m3: false
0:       fp8_hybrid: false
0:       fp8_margin: 0
0:       fp8_interval: 1
0:       fp8_amax_history_len: 1
0:       fp8_amax_compute_algo: most_recent
0:       use_emha: false
0:       data:
0:         data_prefix:
0:         - 1.0
0:         - hfbpe_gpt_training_data_text_document
0:         index_mapping_dir: null
0:         data_impl: mmap
0:         splits_string: 980,10,10
0:         seq_length: 1024
0:         skip_warmup: true
0:         num_workers: 2
0:         dataloader_type: single
0:         reset_position_ids: false
0:         reset_attention_mask: false
0:         eod_mask_loss: false
0:         validation_drop_last: true
0:       nsys_profile:
0:         enabled: false
0:         start_step: 10
0:         end_step: 10
0:         ranks:
0:         - 0
0:         gen_shape: false
0:       optim:
0:         name: fused_adam
0:         lr: 0.0006
0:         weight_decay: 0.1
0:         betas:
0:         - 0.9
0:         - 0.95
0:         sched:
0:           name: CosineAnnealing
0:           warmup_steps: 750
0:           constant_steps: 80000
0:           min_lr: 6.0e-05
0:     
0: [NeMo I 2023-01-18 07:39:41 megatron_gpt_pretraining:37] 
0:     name: megatron_gpt
0:     restore_from_path: null
0:     trainer:
0:       devices: 4
0:       num_nodes: 2
0:       accelerator: gpu
0:       precision: 16
0:       logger: false
0:       enable_checkpointing: false
0:       replace_sampler_ddp: false
0:       max_epochs: null
0:       max_steps: 300000
0:       log_every_n_steps: 50
0:       val_check_interval: 300
0:       limit_val_batches: 50
0:       limit_test_batches: 50
0:       accumulate_grad_batches: 1
0:       gradient_clip_val: 1.0
0:       benchmark: false
0:       enable_model_summary: false
0:     exp_manager:
0:       explicit_log_dir: null
0:       exp_dir: null
0:       name: megatron_gpt
0:       create_wandb_logger: false
0:       wandb_logger_kwargs:
0:         project: null
0:         name: null
0:       resume_if_exists: true
0:       resume_ignore_no_checkpoint: true
0:       create_checkpoint_callback: true
0:       checkpoint_callback_params:
0:         monitor: val_loss
0:         save_top_k: 3
0:         mode: min
0:         always_save_nemo: false
0:         save_nemo_on_train_end: false
0:         filename: megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}
0:         model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
0:     model:
0:       micro_batch_size: 6
0:       global_batch_size: 192
0:       tensor_model_parallel_size: 1
0:       pipeline_model_parallel_size: 1
0:       virtual_pipeline_model_parallel_size: null
0:       encoder_seq_length: 1024
0:       max_position_embeddings: 1024
0:       num_layers: 12
0:       hidden_size: 768
0:       ffn_hidden_size: 3072
0:       num_attention_heads: 12
0:       init_method_std: 0.021
0:       use_scaled_init_method: true
0:       hidden_dropout: 0.1
0:       kv_channels: null
0:       apply_query_key_layer_scaling: true
0:       normalization: layernorm
0:       layernorm_epsilon: 1.0e-05
0:       do_layer_norm_weight_decay: false
0:       make_vocab_size_divisible_by: 128
0:       pre_process: true
0:       post_process: true
0:       persist_layer_norm: true
0:       tokenizer:
0:         library: megatron
0:         type: GPT2BPETokenizer
0:         model: null
0:         vocab_file: gpt2-vocab.json
0:         merge_file: gpt2-merges.txt
0:         delimiter: null
0:         sentencepiece_legacy: false
0:       native_amp_init_scale: 4294967296
0:       native_amp_growth_interval: 1000
0:       hysteresis: 2
0:       fp32_residual_connection: false
0:       fp16_lm_cross_entropy: false
0:       megatron_amp_O2: false
0:       grad_allreduce_chunk_size_mb: 125
0:       grad_div_ar_fusion: true
0:       gradient_accumulation_fusion: false
0:       bias_activation_fusion: true
0:       bias_dropout_add_fusion: true
0:       masked_softmax_fusion: true
0:       seed: 1234
0:       resume_from_checkpoint: null
0:       use_cpu_initialization: false
0:       onnx_safe: false
0:       apex_transformer_log_level: 30
0:       gradient_as_bucket_view: true
0:       sync_batch_comm: false
0:       activations_checkpoint_granularity: null
0:       activations_checkpoint_method: null
0:       activations_checkpoint_num_layers: null
0:       num_micro_batches_with_partial_activation_checkpoints: null
0:       activations_checkpoint_layers_per_pipeline: null
0:       sequence_parallel: false
0:       transformer_engine: false
0:       fp8: false
0:       fp8_e4m3: false
0:       fp8_hybrid: false
0:       fp8_margin: 0
0:       fp8_interval: 1
0:       fp8_amax_history_len: 1
0:       fp8_amax_compute_algo: most_recent
0:       use_emha: false
0:       data:
0:         data_prefix:
0:         - 1.0
0:         - hfbpe_gpt_training_data_text_document
0:         index_mapping_dir: null
0:         data_impl: mmap
0:         splits_string: 980,10,10
0:         seq_length: 1024
0:         skip_warmup: true
0:         num_workers: 2
0:         dataloader_type: single
0:         reset_position_ids: false
0:         reset_attention_mask: false
0:         eod_mask_loss: false
0:         validation_drop_last: true
0:       nsys_profile:
0:         enabled: false
0:         start_step: 10
0:         end_step: 10
0:         ranks:
0:         - 0
0:         gen_shape: false
0:       optim:
0:         name: fused_adam
0:         lr: 0.0006
0:         weight_decay: 0.1
0:         betas:
0:         - 0.9
0:         - 0.95
0:         sched:
0:           name: CosineAnnealing
0:           warmup_steps: 750
0:           constant_steps: 80000
0:           min_lr: 6.0e-05
0:     
0: [NeMo I 2023-01-18 07:39:41 megatron_gpt_pretraining:36] 
0:     
0:     ************** Experiment configuration ***********
0: [NeMo I 2023-01-18 07:39:41 megatron_gpt_pretraining:37] 
0:     name: megatron_gpt
0:     restore_from_path: null
0:     trainer:
0:       devices: 4
0:       num_nodes: 2
0:       accelerator: gpu
0:       precision: 16
0:       logger: false
0:       enable_checkpointing: false
0:       replace_sampler_ddp: false
0:       max_epochs: null
0:       max_steps: 300000
0:       log_every_n_steps: 50
0:       val_check_interval: 300
0:       limit_val_batches: 50
0:       limit_test_batches: 50
0:       accumulate_grad_batches: 1
0:       gradient_clip_val: 1.0
0:       benchmark: false
0:       enable_model_summary: false
0:     exp_manager:
0:       explicit_log_dir: null
0:       exp_dir: null
0:       name: megatron_gpt
0:       create_wandb_logger: false
0:       wandb_logger_kwargs:
0:         project: null
0:         name: null
0:       resume_if_exists: true
0:       resume_ignore_no_checkpoint: true
0:       create_checkpoint_callback: true
0:       checkpoint_callback_params:
0:         monitor: val_loss
0:         save_top_k: 3
0:         mode: min
0:         always_save_nemo: false
0:         save_nemo_on_train_end: false
0:         filename: megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}
0:         model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
0:     model:
0:       micro_batch_size: 6
0:       global_batch_size: 192
0:       tensor_model_parallel_size: 1
0:       pipeline_model_parallel_size: 1
0:       virtual_pipeline_model_parallel_size: null
0:       encoder_seq_length: 1024
0:       max_position_embeddings: 1024
0:       num_layers: 12
0:       hidden_size: 768
0:       ffn_hidden_size: 3072
0:       num_attention_heads: 12
0:       init_method_std: 0.021
0:       use_scaled_init_method: true
0:       hidden_dropout: 0.1
0:       kv_channels: null
0:       apply_query_key_layer_scaling: true
0:       normalization: layernorm
0:       layernorm_epsilon: 1.0e-05
0:       do_layer_norm_weight_decay: false
0:       make_vocab_size_divisible_by: 128
0:       pre_process: true
0:       post_process: true
0:       persist_layer_norm: true
0:       tokenizer:
0:         library: megatron
0:         type: GPT2BPETokenizer
0:         model: null
0:         vocab_file: gpt2-vocab.json
0:         merge_file: gpt2-merges.txt
0:         delimiter: null
0:         sentencepiece_legacy: false
0:       native_amp_init_scale: 4294967296
0:       native_amp_growth_interval: 1000
0:       hysteresis: 2
0:       fp32_residual_connection: false
0:       fp16_lm_cross_entropy: false
0:       megatron_amp_O2: false
0:       grad_allreduce_chunk_size_mb: 125
0:       grad_div_ar_fusion: true
0:       gradient_accumulation_fusion: false
0:       bias_activation_fusion: true
0:       bias_dropout_add_fusion: true
0:       masked_softmax_fusion: true
0:       seed: 1234
0:       resume_from_checkpoint: null
0:       use_cpu_initialization: false
0:       onnx_safe: false
0:       apex_transformer_log_level: 30
0:       gradient_as_bucket_view: true
0:       sync_batch_comm: false
0:       activations_checkpoint_granularity: null
0:       activations_checkpoint_method: null
0:       activations_checkpoint_num_layers: null
0:       num_micro_batches_with_partial_activation_checkpoints: null
0:       activations_checkpoint_layers_per_pipeline: null
0:       sequence_parallel: false
0:       transformer_engine: false
0:       fp8: false
0:       fp8_e4m3: false
0:       fp8_hybrid: false
0:       fp8_margin: 0
0:       fp8_interval: 1
0:       fp8_amax_history_len: 1
0:       fp8_amax_compute_algo: most_recent
0:       use_emha: false
0:       data:
0:         data_prefix:
0:         - 1.0
0:         - hfbpe_gpt_training_data_text_document
0:         index_mapping_dir: null
0:         data_impl: mmap
0:         splits_string: 980,10,10
0:         seq_length: 1024
0:         skip_warmup: true
0:         num_workers: 2
0:         dataloader_type: single
0:         reset_position_ids: false
0:         reset_attention_mask: false
0:         eod_mask_loss: false
0:         validation_drop_last: true
0:       nsys_profile:
0:         enabled: false
0:         start_step: 10
0:         end_step: 10
0:         ranks:
0:         - 0
0:         gen_shape: false
0:       optim:
0:         name: fused_adam
0:         lr: 0.0006
0:         weight_decay: 0.1
0:         betas:
0:         - 0.9
0:         - 0.95
0:         sched:
0:           name: CosineAnnealing
0:           warmup_steps: 750
0:           constant_steps: 80000
0:           min_lr: 6.0e-05
0:     
0: [NeMo E 2023-01-18 07:39:42 exp_manager:423] You are running multi-node training without SLURM handling the processes. Please note that this is not tested in NeMo and could result in errors.
0: [NeMo W 2023-01-18 07:39:42 exp_manager:614] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
0: [NeMo E 2023-01-18 07:39:42 exp_manager:423] You are running multi-node training without SLURM handling the processes. Please note that this is not tested in NeMo and could result in errors.
0: [NeMo W 2023-01-18 07:39:42 exp_manager:614] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
0: [NeMo W 2023-01-18 07:39:42 exp_manager:466] There was no checkpoint folder at checkpoint_dir :/project/home/p200097/faton/nemo_test/nemo_experiments/megatron_gpt/checkpoints. Training from scratch.
0: [NeMo I 2023-01-18 07:39:42 exp_manager:315] Experiments will be logged at /project/home/p200097/faton/nemo_test/nemo_experiments/megatron_gpt
0: [NeMo W 2023-01-18 07:39:42 exp_manager:466] There was no checkpoint folder at checkpoint_dir :/project/home/p200097/faton/nemo_test/nemo_experiments/megatron_gpt/checkpoints. Training from scratch.
0: [NeMo I 2023-01-18 07:39:42 exp_manager:315] Experiments will be logged at /project/home/p200097/faton/nemo_test/nemo_experiments/megatron_gpt
0: [NeMo E 2023-01-18 07:39:42 exp_manager:423] You are running multi-node training without SLURM handling the processes. Please note that this is not tested in NeMo and could result in errors.
0: [NeMo W 2023-01-18 07:39:42 exp_manager:614] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
0: [NeMo W 2023-01-18 07:39:42 exp_manager:466] There was no checkpoint folder at checkpoint_dir :/project/home/p200097/faton/nemo_test/nemo_experiments/megatron_gpt/checkpoints. Training from scratch.
0: [NeMo I 2023-01-18 07:39:42 exp_manager:315] Experiments will be logged at /project/home/p200097/faton/nemo_test/nemo_experiments/megatron_gpt
0: [NeMo I 2023-01-18 07:39:42 exp_manager:704] TensorboardLogger has been set up
0: [NeMo W 2023-01-18 07:39:42 exp_manager:971] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 300000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
0: [NeMo I 2023-01-18 07:39:42 exp_manager:704] TensorboardLogger has been set up
0: [NeMo I 2023-01-18 07:39:42 exp_manager:704] TensorboardLogger has been set up
0: [NeMo W 2023-01-18 07:39:42 exp_manager:971] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 300000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
0: [NeMo W 2023-01-18 07:39:42 exp_manager:971] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 300000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
0: [NeMo I 2023-01-18 07:39:42 megatron_gpt_pretraining:74] Resuming training from checkpoint: None
0: [NeMo I 2023-01-18 07:39:42 megatron_gpt_pretraining:74] Resuming training from checkpoint: None
0: [NeMo I 2023-01-18 07:39:42 megatron_gpt_pretraining:74] Resuming training from checkpoint: None
0: [NeMo I 2023-01-18 07:39:42 megatron_init:223] Rank 3 has data parallel group: [0, 1, 2, 3, 4, 5, 6, 7]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:223] Rank 1 has data parallel group: [0, 1, 2, 3, 4, 5, 6, 7]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:223] Rank 2 has data parallel group: [0, 1, 2, 3, 4, 5, 6, 7]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:226] All data parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:227] Ranks 3 has data parallel rank: 3
0: [NeMo I 2023-01-18 07:39:42 megatron_init:235] Rank 3 has model parallel group: [3]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:236] All model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:246] Rank 3 has tensor model parallel group: [3]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:250] All tensor model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:251] Rank 3 has tensor model parallel rank: 0
0: [NeMo I 2023-01-18 07:39:42 megatron_init:265] Rank 3 has pipeline model parallel group: [3]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:277] Rank 3 has embedding group: [3]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:283] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:284] Rank 3 has pipeline model parallel rank 0
0: [NeMo I 2023-01-18 07:39:42 megatron_init:285] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:286] Rank 3 has embedding rank: 0
0: [NeMo I 2023-01-18 07:39:42 megatron_init:226] All data parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:227] Ranks 1 has data parallel rank: 1
0: [NeMo I 2023-01-18 07:39:42 megatron_init:235] Rank 1 has model parallel group: [1]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:236] All model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:246] Rank 1 has tensor model parallel group: [1]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:250] All tensor model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:251] Rank 1 has tensor model parallel rank: 0
0: [NeMo I 2023-01-18 07:39:42 megatron_init:265] Rank 1 has pipeline model parallel group: [1]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:277] Rank 1 has embedding group: [1]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:283] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:284] Rank 1 has pipeline model parallel rank 0
0: [NeMo I 2023-01-18 07:39:42 megatron_init:285] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:286] Rank 1 has embedding rank: 0
0: [NeMo I 2023-01-18 07:39:42 megatron_init:226] All data parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:227] Ranks 2 has data parallel rank: 2
0: [NeMo I 2023-01-18 07:39:42 megatron_init:235] Rank 2 has model parallel group: [2]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:236] All model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:246] Rank 2 has tensor model parallel group: [2]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:250] All tensor model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:251] Rank 2 has tensor model parallel rank: 0
0: [NeMo I 2023-01-18 07:39:42 megatron_init:265] Rank 2 has pipeline model parallel group: [2]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:277] Rank 2 has embedding group: [2]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:283] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:284] Rank 2 has pipeline model parallel rank 0
0: [NeMo I 2023-01-18 07:39:42 megatron_init:285] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
0: [NeMo I 2023-01-18 07:39:42 megatron_init:286] Rank 2 has embedding rank: 0
0: [NeMo W 2023-01-18 07:39:42 modelPT:222] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.
0: [NeMo W 2023-01-18 07:39:42 modelPT:222] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.
0: [NeMo W 2023-01-18 07:39:42 modelPT:222] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.
0: [NeMo I 2023-01-18 07:39:42 tokenizer_utils:204] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: /project/home/p200097/faton/nemo_test/gpt2-vocab.json, and merges file: /project/home/p200097/faton/nemo_test/gpt2-merges.txt
0: [NeMo I 2023-01-18 07:39:42 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /project/home/p200097/faton/nemo_test/gpt2-vocab.json, merges_files: /project/home/p200097/faton/nemo_test/gpt2-merges.txt, special_tokens_dict: {}, and use_fast: False
0: [NeMo I 2023-01-18 07:39:42 tokenizer_utils:204] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: /project/home/p200097/faton/nemo_test/gpt2-vocab.json, and merges file: /project/home/p200097/faton/nemo_test/gpt2-merges.txt
0: [NeMo I 2023-01-18 07:39:42 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /project/home/p200097/faton/nemo_test/gpt2-vocab.json, merges_files: /project/home/p200097/faton/nemo_test/gpt2-merges.txt, special_tokens_dict: {}, and use_fast: False
0: [NeMo I 2023-01-18 07:39:42 tokenizer_utils:204] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: /project/home/p200097/faton/nemo_test/gpt2-vocab.json, and merges file: /project/home/p200097/faton/nemo_test/gpt2-merges.txt
0: [NeMo I 2023-01-18 07:39:42 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /project/home/p200097/faton/nemo_test/gpt2-vocab.json, merges_files: /project/home/p200097/faton/nemo_test/gpt2-merges.txt, special_tokens_dict: {}, and use_fast: False
1: Using sep_token, but it is not set yet.
1: Using cls_token, but it is not set yet.
1: Using pad_token, but it is not set yet.
1: Using mask_token, but it is not set yet.
1: Using sep_token, but it is not set yet.
1: Using cls_token, but it is not set yet.
1: Using pad_token, but it is not set yet.
1: Using mask_token, but it is not set yet.
1: Using sep_token, but it is not set yet.
1: Using cls_token, but it is not set yet.
1: Using pad_token, but it is not set yet.
1: Using mask_token, but it is not set yet.
0: Using sep_token, but it is not set yet.
0: Using cls_token, but it is not set yet.
0: Using pad_token, but it is not set yet.
0: Using sep_token, but it is not set yet.
0: Using cls_token, but it is not set yet.
0: Using pad_token, but it is not set yet.
0: Using mask_token, but it is not set yet.
0: [NeMo I 2023-01-18 07:39:45 megatron_base_model:204] Padded vocab_size: 50304, original vocab_size: 50257, dummy tokens: 47.
0: Using mask_token, but it is not set yet.
0: [NeMo I 2023-01-18 07:39:45 megatron_base_model:204] Padded vocab_size: 50304, original vocab_size: 50257, dummy tokens: 47.
0: Using sep_token, but it is not set yet.
0: Using cls_token, but it is not set yet.
0: Using pad_token, but it is not set yet.
0: Using mask_token, but it is not set yet.
0: [NeMo I 2023-01-18 07:39:45 megatron_base_model:204] Padded vocab_size: 50304, original vocab_size: 50257, dummy tokens: 47.
0: Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
0: Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
0: Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
1: Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
1: Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
1: Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
1: Added key: store_based_barrier_key:1 to store for rank: 1
1: Added key: store_based_barrier_key:1 to store for rank: 3
1: Added key: store_based_barrier_key:1 to store for rank: 2
0: Added key: store_based_barrier_key:1 to store for rank: 1
0: Added key: store_based_barrier_key:1 to store for rank: 2
0: Added key: store_based_barrier_key:1 to store for rank: 3
1: Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
1: Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
1: Waiting in store based barrier to initialize process group for rank: 3, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
0: Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
0: Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
0: Waiting in store based barrier to initialize process group for rank: 3, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
1: Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
1: Waiting in store based barrier to initialize process group for rank: 3, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
1: Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
0: Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
0: Waiting in store based barrier to initialize process group for rank: 3, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)
0: Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)

Lauler commented 1 year ago

Seems possibly related to this Lightning issue: https://github.com/Lightning-AI/lightning/issues/10098

itzsimpl commented 1 year ago

Have you tried the following config:

...
#SBATCH --nodes=<n>
#SBATCH --tasks-per-node=<m>
#SBATCH --gpus-per-node=<m>
...
    trainer.devices=-1 \
    trainer.num_nodes=$SLURM_JOB_NUM_NODES \

This will start up SLURM_TASKS_PER_NODE tasks on each node, each will have access to SLURM_GPUS_PER_NODE GPUs, but each slurm run task will use only one GPU.

As for other potential causes, they may depend on the specifics of your setup, for those you may wish to enable the following:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

It helped me in tracking down an issue I had with --gpus-per-task, see https://github.com/NVIDIA/pyxis/issues/73.

Lauler commented 1 year ago

Thanks for sharing and suggesting fixes. When I try your suggested config of

...
#SBATCH --nodes=2
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
...
    trainer.devices=-1 \
    trainer.num_nodes=$SLURM_JOB_NUM_NODES \

it still has problems setting the global ranks correctly. All the processes now are GLOBAL_RANK: 0. Additionally it doesn't seem to have a proper sense of the world size.

0: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
1: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
2: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
...
...
7: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

In all the different configurations of sbatch params I've tried, Pytorch Lightning seems to have issues setting GLOBAL_RANK correctly. I think it gets set and calculated here: https://github.com/Lightning-AI/lightning/blob/cc56539cd3b3875d9d374d55004b4b86e07b47a9/src/pytorch_lightning/strategies/ddp.py#L194, but I honestly have trouble understanding how where the values come from with all those levels of inheritance and imports.

It results in all processes except one crashing when they connect to the same address:

0: Error executing job with overrides: ['trainer.devices=-1', 'trainer.num_nodes=2', 'trainer.max_epochs=null', 'trainer.max_steps=300000', 'trainer.val_check_interval=300', 'trainer.log_every_n_steps=50', 'trainer.limit_val_batches=50', 'trainer.limit_test_batches=50', 'trainer.accumulate_grad_batches=1', 'trainer.precision=16', 'model.micro_batch_size=6', 'model.global_batch_size=192', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'model.max_position_embeddings=1024', 'model.encoder_seq_length=1024', 'model.hidden_size=768', 'model.ffn_hidden_size=3072', 'model.num_layers=12', 'model.num_attention_heads=12', 'model.init_method_std=0.021', 'model.hidden_dropout=0.1', 'model.layernorm_epsilon=1e-5', 'model.tokenizer.vocab_file=gpt2-vocab.json', 'model.tokenizer.merge_file=gpt2-merges.txt', 'model.data.data_prefix=[1.0,hfbpe_gpt_training_data_text_document]', 'model.data.num_workers=64', 'model.data.seq_length=1024', "model.data.splits_string='980,10,10'", 'model.optim.name=fused_a
0: dam', 'model.optim.lr=6e
0: -4', 'model.optim.betas=[0.9,0.95]', 'model.optim.weight_decay=0.1', 'model.optim.sched.name=CosineAnnealing', 'model.optim.sched.warmup_steps=750', 'model.optim.sched.constant_steps=80000', 'model.optim.sched.min_lr=6e-5', 'exp_manager.resume_if_exists=True', 'exp_manager.resume_ignore_no_checkpoint=True', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.checkpoint_callback_params.monitor=val_loss', 'exp_manager.checkpoint_callback_params.save_top_k=3', 'exp_manager.checkpoint_callback_params.mode=min', 'exp_manager.checkpoint_callback_params.always_save_nemo=False']
0: Traceback (most recent call last):
0:   File "/workspace/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 88, in main
0:     trainer.fit(model)
0:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
0:     self._call_and_handle_interrupt(
0:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
0:     return self.strategy.launcher.la
0: unch(trainer_fn, *args, trainer=self, **kwargs)
0:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
0:     return function(*args, **kwargs)
0:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
0:     results = self._run(model, ckpt_path=self.ckpt_path)
0:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1102, in _run
0:     self.strategy.setup_environment()
0:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 157, in setup_environment
0:     self.setup_distributed()
0:   File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 81, in setup_distributed
0:     super().setup_distributed()
0:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 210, in setup_distributed
0:     init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
0:   File 
0: "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 374, in init_dist_connection
0:     torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
0:   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
0:     store, rank, world_size = next(rendezvous_iterator)
0:   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
0:     store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
0:   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
0:     return TCPStore(
0: RuntimeError: The server socket has failed to listen on any local network address. The server socket has fa
0: iled to bind to [::]:53394 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:53394 (errno: 98 - Address already in use).
0: 
0: Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

If I change trainer.devices=-1 to trainer.devices=4 I instead get the following error:

4: Error executing job with overrides: ['trainer.devices=4', 'trainer.num_nodes=2', 'trainer.max_epochs=null', 'trainer.max_steps=300000', 'trainer.val_check_interval=300', 'trainer.log_every_n_steps=50', 'trainer.limit_val_batches=50', 'trainer.limit_test_batches=50', 'trainer.accumulate_grad_batches=1', 'trainer.precision=16', 'model.micro_batch_size=6', 'model.global_batch_size=192', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'model.max_position_embeddings=1024', 'model.encoder_seq_length=1024', 'model.hidden_size=768', 'model.ffn_hidden_size=3072', 'model.num_layers=12', 'model.num_attention_heads=12', 'model.init_method_std=0.021', 'model.hidden_dropout=0.1', 'model.layernorm_epsilon=1e-5', 'model.tokenizer.vocab_file=gpt2-vocab.json', 'model.tokenizer.merge_file=gpt2-merges.txt', 'model.data.data_prefix=[1.0,hfbpe_gpt_training_data_text_document]', 'model.data.num_workers=64', 'model.data.seq_length=1024', "model.data.splits_string='980,10,10'", 'model.optim.name=fused_ad
4: am', 'model.optim.lr=6e-
4: 4', 'model.optim.betas=[0.9,0.95]', 'model.optim.weight_decay=0.1', 'model.optim.sched.name=CosineAnnealing', 'model.optim.sched.warmup_steps=750', 'model.optim.sched.constant_steps=80000', 'model.optim.sched.min_lr=6e-5', 'exp_manager.resume_if_exists=True', 'exp_manager.resume_ignore_no_checkpoint=True', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.checkpoint_callback_params.monitor=val_loss', 'exp_manager.checkpoint_callback_params.save_top_k=3', 'exp_manager.checkpoint_callback_params.mode=min', 'exp_manager.checkpoint_callback_params.always_save_nemo=False']
4: Traceback (most recent call last):
4:   File "/workspace/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 64, in main
4:     trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer)
4:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse.py", line 345, in insert_env_defaults
4:     return fn(self, **kwargs)
4:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 433, in __
4: init__
4:     self._accelerator_connector = AcceleratorConnector(
4:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 214, in __init__
4:     self._set_parallel_devices_and_init_accelerator()
4:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 545, in _set_parallel_devices_and_init_accelerator
4:     self._devices_flag = self.accelerator.parse_devices(self._devices_flag)
4:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/cuda.py", line 77, in parse_devices
4:     return device_parser.parse_gpu_ids(devices, include_cuda=True)
4:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 125, in parse_gpu_ids
4:     return _sanitize_gpu_ids(gpus, include_cuda=include_cuda, include_mps=include_mps)
4:   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 209, in _sanitize_gpu_ids
4:     raise MisconfigurationException(
4: pyto
4: rch_lightning.utilities.exceptions.MisconfigurationException: You requested gpu: [0, 1, 2, 3]
4:  But your machine only has: [0]

We haven't been using this pattern of SLURM_TASKS_PER_NODE being equal to GPUs per node previously when training with Megatron-LM and launching jobs with torch.distributed.launch. There we launch with one process per node.

Lauler commented 1 year ago

Seems like the likely culprit in our case is that only 1 GPU looks to be available per process when running torch.cuda.device.count().

However, all 4 GPU devices show up in each of the individual processes when running nvidia-smi or nvidia-smi -L.

*edit:

Although

echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"

gives only single visible devices for each task :unamused: .

Lauler commented 1 year ago

We found a rather hacky solution to make training work. To anyone reading this in the future who runs in to the same issue:

Our problem was that all GPUs in a node were not visible to a process whenever we started Slurm jobs with the recommendation of --ntasks-per-node being equal to trainer.devices. The aforementioned resulted in only 1 GPU being visible per process, and we weren't able to rectify that.

Solution

Start a sbatch job with for example --nodes=2 and --gres:gpu:4. Export the MASTER_ADDR as an environment variable.
Manually calculate the GLOBAL_RANK of the current process in the bash script of the fiel that launches the training. Do NOT export the LOCAL_RANK as an environment variable (Lightning/Pytorch DDP will get stuck without initializing the rest of the GPUs if you do).

For reference, here's our sbatch-script:

#!/bin/bash -l
#SBATCH --partition=gpu
#SBATCH --qos=test
#SBATCH --account=p200097
#SBATCH --job-name=gpt_nemo
#SBATCH --nodes=4
#SBATCH --gres=gpu:4
#SBATCH --time=0-00:30:00
#SBATCH --output=logs/gpt_nemo.log

# Modules
pwd
module purge
module load Singularity-CE

## Create needed distributed env variables
addr=$(/bin/hostname -s)
export MASTER_ADDR=$addr
export MASTER_PORT=16783 # Meluxina overwrites this variable after srun
export GPUS_PER_NODE=4
export NCCL_CROSS_NIC=1

# debugging flags (optional)
export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=ALL
export PYTHONFAULTHANDLER=1
export HYDRA_FULL_ERROR=1

# Logfile
DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
PROJECT=/project/home/p200097/faton/nemo_test # Use abs path, not symbolic link
CONTAINER_PATH=/project/home/p200097/faton/nemo_test/nemo2209b.sif
LOGGING=$PROJECT/logs
LOGFILE="${LOGGING}/%x_${DATETIME}.log"
echo $LOGFILE

ls -lh

cmd="srun -l --output=$LOGGING/gpt_nemo_$DATETIME.log \
      singularity exec --nv --bind $PROJECT:$PROJECT --bind /project/scratch/p200097/data/nemo_test:/mnt $CONTAINER_PATH \
      bash $PROJECT/training_args.sh"

$cmd

And here's our training_args.sh that launches the training for each process (1 process per node):

/bin/hostname -s

export MASTER_PORT=16783
export NODE_RANK=$SLURM_NODEID
# export LOCAL_RANK=$SLURM_LOCALID # Local rank needs to be uninitialized for Lightning to work properly with DDP and 1 process per node
# export GLOBAL_RANK=$SLURM_PROCID # if --ntasks-per-node == devices, then PROCID is the global_rank. But training with --ntasks-per-node doesn't work.
export GLOBAL_RANK=$((SLURM_NODEID * GPUS_PER_NODE + LOCAL_RANK)) # When only 1 process per node, this calculates global_rank

echo "----------"
echo "NODE_RANK" $NODE_RANK
echo "LOCAL_RANK" $LOCAL_RANK
echo "GLOBAL_RANK" $GLOBAL_RANK
echo "WORLD_SIZE" $WORLD_SIZE
echo "MASTER_PORT" $MASTER_PORT
echo "---------------------"
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
nvidia-smi -L

python /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
    --config-path=/workspace/nemo/examples/nlp/language_modeling/conf \
    --config-name=megatron_gpt_config \
    trainer.devices=$GPUS_PER_NODE \
    ...bunch-of-args \
    ...

Hope this helps someone in the future trying to train multi-node with NeMo and Slurm.

titu1994 commented 1 year ago

We use slurm for all our clusters, none of the above is needed. We follow PTL guidelines, and the only thing we normally do is add cuda visible devices flag with all the GPUs in the list. That seems to work fine without resorting to these steps

So if there are 8 GPUs per node, we do CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8 python nemo_script.py ... trainer.num_nodes=x trainer.devices=-1

Lauler commented 1 year ago

Thank you very much @titu1994 . It had not ever occurred to me that CUDA_VISIBLE_DEVICES would be something a user would manually want or need to edit. Always assumed it would be something correctly or appropriately set by the system and not for a user to touch.

It would probably be helpful if you guys posted an example sbatch script in the documentation, to save some others from future headache.

Thanks again for the tip about setting CUDA_VISIBLE_DEVICES, it works perfectly with PTL guidelines.

titu1994 commented 1 year ago

@SeanNaren we should note this in your aws tutorial (though I dunno if that uses slurm directly or AWS sage maker). Maybe also let's comment on PTL slack to add this info to the end of https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_advanced.html

https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_advanced.html

NVIDIA / NeMo

Multi-node communication problem using Slurm and NeMo Megatron official GPT docs example #5819

Solution