NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.64k stars 2.38k forks source link

[BUG] 'NoneType' object has no attribute 'shape' error raised when saving model state with the pretrain_gpt.py #1134

Open hwang2006 opened 2 months ago

hwang2006 commented 2 months ago

Hi, It seems that the same code is **working fine with when the Megatron-LM that I git-cloned in April. With the latest Megatron-LM, I've got the following error raised with the pretrain_gpt.py code. It seems that the Megatron cores codes have been upgraded since April, Note that for some testing purpose, I set --save-interval argument is set to be 50 below.

$  CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 2  pretrain_gpt.py     --tensor-model-parallel-size 2     --pipeline-model-parallel-size 1         --num-layers 24     --hidden-size 1024     --num-attention-heads 16     --seq-length 1024     --max-position-embeddings 1024     --micro-batch-size 4     --global-batch-size 16     --lr 0.00015     --train-iters 200     --lr-decay-iters 320000     --lr-decay-style cosine     --min-lr 1.0e-5     --weight-decay 1e-2     --lr-warmup-fraction .01     --clip-grad 1.0     --fp16 --data-path  my-gpt2_text_document     --vocab-file vocab.json     --merge-file merges.txt     --split 949,50,1 --log-interval 10     --save-interval 50    --eval-interval 100     --eval-iters 10 --distributed-backend nccl  --save checkpoints/gpt2_345m_dist_mp     --load  checkpoints/gpt2_345m_dist_mp --attention-softmax-in-fp32 --sequence-parallel
W0913 13:50:04.400000 47065146630080 torch/distributed/run.py:779]
W0913 13:50:04.400000 47065146630080 torch/distributed/run.py:779] *****************************************
W0913 13:50:04.400000 47065146630080 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0913 13:50:04.400000 47065146630080 torch/distributed/run.py:779] *****************************************
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:280: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias, allreduce_dgrad):
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:280: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias, allreduce_dgrad):
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:381: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:420: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:381: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:420: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
using world size: 2, data-parallel size: 1, context-parallel size: 1, tensor-model-parallel size: 2, encoder-tensor-model-parallel size: 0, pipeline-model-parallel size: 1, encoder-pipeline-model-parallel size: 0
WARNING: Setting args.overlap_p2p_comm and args.align_param_gather to False since non-interleaved schedule does not support overlapping p2p communication and aligned param AG
WARNING: Setting args.check_for_nan_in_loss_and_grad to False since dynamic loss scaling is being used
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  align_grad_reduce ............................... True
  align_param_gather .............................. False
  app_tag_run_name ................................ None
  app_tag_run_version ............................. 0.0.0
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... True
  async_save ...................................... None
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... True
  auto_detect_ckpt_format ......................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  bias_swiglu_fusion .............................. True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  calculate_per_token_loss ........................ False
  check_for_nan_in_loss_and_grad .................. False
  check_weight_hash_across_dp_replicas_interval ... None
  ckpt_assume_constant_structure .................. False
  ckpt_convert_format ............................. None
  ckpt_convert_save ............................... None
  ckpt_convert_update_legacy_dist_opt_format ...... False
  ckpt_format ..................................... torch_dist
  ckpt_fully_parallel_load ........................ False
  ckpt_fully_parallel_save ........................ True
  ckpt_fully_parallel_save_deprecated ............. False
  ckpt_step ....................................... None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  config_logger_dir ...............................
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  create_attention_mask_in_dataloader ............. True
  cross_entropy_loss_fusion ....................... False
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... ['my-gpt2_text_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  ddp_average_in_collective ....................... False
  ddp_bucket_size ................................. None
  decoder_first_pipeline_num_layers ............... None
  decoder_last_pipeline_num_layers ................ None
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  decoupled_lr .................................... None
  decoupled_min_lr ................................ None
  decrease_batch_size_if_needed ................... False
  defer_embedding_wgrad_compute ................... False
  deprecated_use_mcore_models ..................... False
  deterministic_mode .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  disable_straggler_on_startup .................... False
  dist_ckpt_format_deprecated ..................... None
  dist_ckpt_strictness ............................ assume_ok_unexpected
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  enable_ft_package ............................... False
  enable_one_logger ............................... True
  encoder_num_layers .............................. 24
  encoder_pipeline_model_parallel_size ............ 0
  encoder_seq_length .............................. 1024
  encoder_tensor_model_parallel_size .............. 0
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 100
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  ffn_hidden_size ................................. 4096
  finetune ........................................ False
  fp16 ............................................ True
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_param_gather ................................ False
  fp8_wgrad ....................................... True
  global_batch_size ............................... 16
  gradient_accumulation_fusion .................... True
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 1024
  hybrid_attention_ratio .......................... 0.0
  hybrid_mlp_ratio ................................ 0.0
  hybrid_override_pattern ......................... None
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 64
  lazy_mpu_init ................................... None
  load ............................................ checkpoints/gpt2_345m_dist_mp
  local_rank ...................................... 0
  log_interval .................................... 10
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_progress .................................... False
  log_straggler ................................... False
  log_throughput .................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  logging_level ................................... None
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.00015
  lr_decay_iters .................................. 320000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. 0.01
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  lr_wsd_decay_iters .............................. None
  lr_wsd_decay_samples ............................ None
  lr_wsd_decay_style .............................. exponential
  make_vocab_size_divisible_by .................... 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 1024
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... merges.txt
  micro_batch_size ................................ 4
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-05
  mmap_bin_files .................................. True
  mock_data ....................................... False
  moe_aux_loss_coeff .............................. 0.0
  moe_expert_capacity_factor ...................... None
  moe_extended_tp ................................. False
  moe_grouped_gemm ................................ False
  moe_input_jitter_eps ............................ None
  moe_layer_recompute ............................. False
  moe_pad_expert_input_to_capacity ................ False
  moe_per_layer_logging ........................... False
  moe_router_load_balancing_type .................. aux_loss
  moe_router_pre_softmax .......................... False
  moe_router_topk ................................. 2
  moe_token_dispatcher_type ....................... allgather
  moe_token_drop_policy ........................... probs
  moe_use_upcycling ............................... False
  moe_z_loss_coeff ................................ None
  nccl_communicator_config_path ................... None
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  non_persistent_ckpt_type ........................ None
  non_persistent_global_ckpt_dir .................. None
  non_persistent_local_ckpt_algo .................. fully_parallel
  non_persistent_local_ckpt_dir ................... None
  non_persistent_save_interval .................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... LayerNorm
  num_attention_heads ............................. 16
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_dataset_builder_threads ..................... 1
  num_experts ..................................... None
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_workers ..................................... 2
  one_logger_async ................................ False
  one_logger_project .............................. megatron-lm
  one_logger_run_name ............................. None
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ False
  overlap_param_gather_with_optimizer_step ........ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.float16
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... learned_absolute
  pretrained_checkpoint ........................... None
  profile ......................................... False
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  qk_layernorm .................................... False
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  renormalize_blend_weights ....................... False
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_attention_gate ............................ 1
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_project_dir ............................... None
  retro_verify_neighbor_count ..................... True
  rotary_base ..................................... 10000
  rotary_interleaved .............................. False
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  s3_cache_path ................................... None
  sample_rate ..................................... 1.0
  save ............................................ checkpoints/gpt2_345m_dist_mp
  save_interval ................................... 50
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 1024
  sequence_parallel ............................... True
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  skipped_train_samples ........................... 0
  spec ............................................ None
  split ........................................... 949,50,1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  straggler_ctrlr_port ............................ 65535
  straggler_minmax_count .......................... 1
  swiglu .......................................... False
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 2
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_mode ....................................... False
  tiktoken_num_special_tokens ..................... 1000
  tiktoken_pattern ................................ None
  tiktoken_special_tokens ......................... None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. GPT2BPETokenizer
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_ag .............................. True
  tp_comm_overlap_cfg ............................. None
  tp_comm_overlap_rs .............................. True
  tp_comm_overlap_rs_dgrad ........................ False
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... 200
  train_samples ................................... None
  train_sync_interval ............................. None
  transformer_impl ................................ transformer_engine
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_dist_ckpt ................................... True
  use_dist_ckpt_deprecated ........................ False
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. False
  use_legacy_models ............................... False
  use_one_sent_docs ............................... False
  use_pytorch_profiler ............................ False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  use_tp_pp_dp_mapping ............................ False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... vocab.json
  vocab_size ...................................... None
  wandb_exp_name ..................................
  wandb_project ...................................
  wandb_save_dir ..................................
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  wgrad_deferral_limit ............................ 0
  world_size ...................................... 2
  yaml_cfg ........................................ None
-------------------end of arguments --
INFO:megatron.core.num_microbatches_calculator:setting number of microbatches to constant 4
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 175 dummy tokens (new size: 50432)
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
> initializing torch distributed ...
> initialized tensor model parallel with size 2
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory `/scratch/qualis/test/Megatron-LM/megatron/core/datasets'
make: Nothing to be done for `default'.
make: Leaving directory `/scratch/qualis/test/Megatron-LM/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.079 seconds
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 0.187 seconds
[rank1]:[W913 13:50:25.428850695 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank0]:[W913 13:50:25.428869340 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
time to initialize megatron (seconds): 13.838
[after megatron is initialized] datetime: 2024-09-13 13:50:34
building GPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 178100224
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 178100224
INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=False, overlap_grad_reduce=False, overlap_param_gather=False, align_param_gather=False, use_distributed_optimizer=False, check_for_nan_in_grad=False, bucket_size=None, average_in_collective=False, fp8_param_gather=False)
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
Params for bucket 1 (178100224 elements):
        module.decoder.layers.5.self_attention.linear_proj.weight
        module.decoder.layers.4.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.2.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.0.mlp.linear_fc1.bias
        module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.16.mlp.linear_fc2.weight
        module.decoder.layers.5.mlp.linear_fc2.bias
        module.decoder.layers.4.mlp.linear_fc2.bias
        module.decoder.layers.3.mlp.linear_fc1.weight
        module.decoder.layers.2.mlp.linear_fc2.bias
        module.decoder.layers.13.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.0.self_attention.linear_qkv.bias
        module.decoder.layers.7.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.7.mlp.linear_fc1.weight
        module.decoder.layers.23.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.10.mlp.linear_fc2.weight
        module.decoder.layers.9.mlp.linear_fc2.weight
        module.decoder.layers.4.self_attention.linear_proj.weight
        module.decoder.layers.1.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.23.mlp.linear_fc1.bias
        module.decoder.layers.22.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.21.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.20.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.5.self_attention.linear_qkv.weight
        module.decoder.layers.2.mlp.linear_fc1.weight
        module.decoder.layers.6.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.6.mlp.linear_fc1.bias
        module.decoder.layers.3.mlp.linear_fc1.bias
        module.decoder.layers.18.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.23.self_attention.linear_proj.bias
        module.decoder.layers.21.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.20.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.19.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.17.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.15.mlp.linear_fc2.bias
        module.decoder.layers.10.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.2.mlp.linear_fc2.weight
        module.decoder.layers.0.mlp.linear_fc2.weight
        module.decoder.layers.14.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.13.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.12.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.11.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.2.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.1.self_attention.linear_qkv.bias
        module.embedding.word_embeddings.weight
        module.decoder.layers.6.mlp.linear_fc2.bias
        module.decoder.layers.8.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.7.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.6.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.9.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.8.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.9.mlp.linear_fc1.bias
        module.decoder.layers.8.mlp.linear_fc1.weight
        module.decoder.layers.4.mlp.linear_fc2.weight
        module.decoder.layers.23.self_attention.linear_qkv.bias
        module.decoder.layers.22.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.3.self_attention.linear_proj.bias
        module.decoder.layers.1.mlp.linear_fc1.bias
        module.embedding.position_embeddings.weight
        module.decoder.layers.23.mlp.linear_fc1.weight
        module.decoder.layers.22.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.21.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.20.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.19.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.18.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.17.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.7.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.5.mlp.linear_fc2.weight
        module.decoder.layers.0.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.0.self_attention.linear_proj.weight
        module.decoder.layers.23.self_attention.linear_proj.weight
        module.decoder.layers.21.mlp.linear_fc1.bias
        module.decoder.layers.20.mlp.linear_fc1.bias
        module.decoder.layers.19.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.18.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.17.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.16.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.15.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.15.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.14.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.11.self_attention.linear_qkv.bias
        module.decoder.layers.2.self_attention.linear_proj.weight
        module.decoder.layers.23.mlp.linear_fc2.bias
        module.decoder.layers.21.self_attention.linear_proj.bias
        module.decoder.layers.20.self_attention.linear_proj.bias
        module.decoder.layers.14.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.13.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.12.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.11.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.10.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.9.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.8.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.8.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.7.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.6.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.10.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.3.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.0.self_attention.linear_qkv.weight
        module.decoder.layers.2.self_attention.linear_qkv.weight
        module.decoder.layers.23.self_attention.linear_qkv.weight
        module.decoder.layers.22.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.3.self_attention.linear_qkv.weight
        module.decoder.final_layernorm.weight
        module.decoder.layers.22.mlp.linear_fc1.bias
        module.decoder.layers.21.self_attention.linear_qkv.bias
        module.decoder.layers.20.self_attention.linear_qkv.bias
        module.decoder.layers.19.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.18.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.17.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.16.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.9.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.4.self_attention.linear_proj.bias
        module.decoder.layers.3.mlp.linear_fc2.weight
        module.decoder.layers.3.mlp.linear_fc2.bias
        module.decoder.layers.21.mlp.linear_fc1.weight
        module.decoder.layers.20.mlp.linear_fc1.weight
        module.decoder.layers.19.mlp.linear_fc1.bias
        module.decoder.layers.18.mlp.linear_fc1.bias
        module.decoder.layers.17.mlp.linear_fc1.bias
        module.decoder.layers.16.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.15.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.15.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.14.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.13.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.19.self_attention.linear_proj.bias
        module.decoder.layers.18.self_attention.linear_proj.bias
        module.decoder.layers.21.self_attention.linear_proj.weight
        module.decoder.layers.20.self_attention.linear_proj.weight
        module.decoder.layers.17.self_attention.linear_proj.bias
        module.decoder.layers.14.mlp.linear_fc1.bias
        module.decoder.layers.13.mlp.linear_fc1.bias
        module.decoder.layers.12.mlp.linear_fc1.bias
        module.decoder.layers.11.mlp.linear_fc1.bias
        module.decoder.layers.10.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.1.self_attention.linear_proj.bias
        module.decoder.layers.1.mlp.linear_fc2.weight
        module.decoder.layers.21.mlp.linear_fc2.bias
        module.decoder.layers.20.mlp.linear_fc2.bias
        module.decoder.layers.15.self_attention.linear_proj.bias
        module.decoder.layers.14.self_attention.linear_proj.bias
        module.decoder.layers.13.self_attention.linear_proj.bias
        module.decoder.layers.12.self_attention.linear_proj.bias
        module.decoder.layers.11.self_attention.linear_proj.bias
        module.decoder.layers.8.mlp.linear_fc1.bias
        module.decoder.layers.11.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.7.mlp.linear_fc1.bias
        module.decoder.layers.5.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.23.mlp.linear_fc2.weight
        module.decoder.layers.22.self_attention.linear_qkv.bias
        module.decoder.layers.9.self_attention.linear_proj.bias
        module.decoder.layers.8.self_attention.linear_proj.bias
        module.decoder.layers.7.self_attention.linear_proj.bias
        module.decoder.layers.6.self_attention.linear_proj.bias
        module.decoder.layers.4.self_attention.linear_qkv.bias
        module.decoder.layers.22.mlp.linear_fc1.weight
        module.decoder.layers.21.self_attention.linear_qkv.weight
        module.decoder.layers.20.self_attention.linear_qkv.weight
        module.decoder.layers.19.self_attention.linear_qkv.bias
        module.decoder.layers.18.self_attention.linear_qkv.bias
        module.decoder.layers.17.self_attention.linear_qkv.bias
        module.decoder.layers.16.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.4.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.2.self_attention.linear_qkv.bias
        module.decoder.layers.4.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.22.self_attention.linear_proj.bias
        module.decoder.layers.19.mlp.linear_fc1.weight
        module.decoder.layers.18.mlp.linear_fc1.weight
        module.decoder.layers.17.mlp.linear_fc1.weight
        module.decoder.layers.16.mlp.linear_fc1.bias
        module.decoder.layers.15.mlp.linear_fc2.weight
        module.decoder.layers.15.self_attention.linear_qkv.bias
        module.decoder.layers.14.self_attention.linear_qkv.bias
        module.decoder.layers.13.self_attention.linear_qkv.bias
        module.decoder.layers.12.self_attention.linear_qkv.bias
        module.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.18.self_attention.linear_proj.weight
        module.decoder.layers.22.mlp.linear_fc2.bias
        module.decoder.layers.19.self_attention.linear_proj.weight
        module.decoder.layers.17.self_attention.linear_proj.weight
        module.decoder.layers.16.self_attention.linear_proj.bias
        module.decoder.layers.14.mlp.linear_fc1.weight
        module.decoder.layers.13.mlp.linear_fc1.weight
        module.decoder.layers.12.mlp.linear_fc1.weight
        module.decoder.layers.11.mlp.linear_fc1.weight
        module.decoder.layers.10.mlp.linear_fc1.bias
        module.decoder.layers.0.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.19.mlp.linear_fc2.bias
        module.decoder.layers.18.mlp.linear_fc2.bias
        module.decoder.layers.17.mlp.linear_fc2.bias
        module.decoder.layers.15.self_attention.linear_proj.weight
        module.decoder.layers.14.self_attention.linear_proj.weight
        module.decoder.layers.13.self_attention.linear_proj.weight
        module.decoder.layers.12.self_attention.linear_proj.weight
        module.decoder.layers.11.self_attention.linear_proj.weight
        module.decoder.layers.12.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.10.self_attention.linear_proj.bias
        module.decoder.layers.6.self_attention.linear_qkv.bias
        module.decoder.layers.22.self_attention.linear_qkv.weight
        module.decoder.layers.14.mlp.linear_fc2.bias
        module.decoder.layers.13.mlp.linear_fc2.bias
        module.decoder.layers.12.mlp.linear_fc2.bias
        module.decoder.layers.11.mlp.linear_fc2.bias
        module.decoder.layers.9.self_attention.linear_proj.weight
        module.decoder.layers.8.self_attention.linear_proj.weight
        module.decoder.layers.7.self_attention.linear_proj.weight
        module.decoder.layers.6.self_attention.linear_proj.weight
        module.decoder.layers.3.self_attention.linear_qkv.bias
        module.decoder.layers.1.mlp.linear_fc2.bias
        module.decoder.layers.18.self_attention.linear_qkv.weight
        module.decoder.final_layernorm.bias
        module.decoder.layers.21.mlp.linear_fc2.weight
        module.decoder.layers.20.mlp.linear_fc2.weight
        module.decoder.layers.19.self_attention.linear_qkv.weight
        module.decoder.layers.17.self_attention.linear_qkv.weight
        module.decoder.layers.16.self_attention.linear_qkv.bias
        module.decoder.layers.11.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.8.mlp.linear_fc2.bias
        module.decoder.layers.7.mlp.linear_fc2.bias
        module.decoder.layers.5.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.22.self_attention.linear_proj.weight
        module.decoder.layers.16.mlp.linear_fc1.weight
        module.decoder.layers.15.mlp.linear_fc1.bias
        module.decoder.layers.15.self_attention.linear_qkv.weight
        module.decoder.layers.14.self_attention.linear_qkv.weight
        module.decoder.layers.13.self_attention.linear_qkv.weight
        module.decoder.layers.12.self_attention.linear_qkv.weight
        module.decoder.layers.11.self_attention.linear_qkv.weight
        module.decoder.layers.10.self_attention.linear_qkv.bias
        module.decoder.layers.1.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.0.mlp.linear_fc2.bias
        module.decoder.layers.23.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.16.self_attention.linear_proj.weight
        module.decoder.layers.10.mlp.linear_fc1.weight
        module.decoder.layers.9.mlp.linear_fc1.weight
        module.decoder.layers.8.self_attention.linear_qkv.weight
        module.decoder.layers.7.self_attention.linear_qkv.weight
        module.decoder.layers.6.self_attention.linear_qkv.weight
        module.decoder.layers.5.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.4.self_attention.linear_qkv.weight
        module.decoder.layers.4.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.1.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.1.self_attention.linear_proj.weight
        module.decoder.layers.16.mlp.linear_fc2.bias
        module.decoder.layers.10.self_attention.linear_proj.weight
        module.decoder.layers.5.mlp.linear_fc1.bias
        module.decoder.layers.4.mlp.linear_fc1.bias
        module.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.7.self_attention.linear_qkv.bias
        module.decoder.layers.2.mlp.linear_fc1.bias
        module.decoder.layers.22.mlp.linear_fc2.weight
        module.decoder.layers.10.mlp.linear_fc2.bias
        module.decoder.layers.9.mlp.linear_fc2.bias
        module.decoder.layers.9.self_attention.linear_qkv.weight
        module.decoder.layers.5.self_attention.linear_proj.bias
        module.decoder.layers.3.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.1.mlp.linear_fc1.weight
        module.decoder.layers.0.self_attention.linear_proj.bias
        module.decoder.layers.18.mlp.linear_fc2.weight
        module.decoder.layers.19.mlp.linear_fc2.weight
        module.decoder.layers.17.mlp.linear_fc2.weight
        module.decoder.layers.16.self_attention.linear_qkv.weight
        module.decoder.layers.12.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.6.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.6.mlp.linear_fc1.weight
        module.decoder.layers.23.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.15.mlp.linear_fc1.weight
        module.decoder.layers.14.mlp.linear_fc2.weight
        module.decoder.layers.13.mlp.linear_fc2.weight
        module.decoder.layers.12.mlp.linear_fc2.weight
        module.decoder.layers.11.mlp.linear_fc2.weight
        module.decoder.layers.10.self_attention.linear_qkv.weight
        module.decoder.layers.9.self_attention.linear_qkv.bias
        module.decoder.layers.3.self_attention.linear_proj.weight
        module.decoder.layers.2.self_attention.linear_proj.bias
        module.decoder.layers.1.self_attention.linear_qkv.weight
        module.decoder.layers.23.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.8.mlp.linear_fc2.weight
        module.decoder.layers.7.mlp.linear_fc2.weight
        module.decoder.layers.6.mlp.linear_fc2.weight
        module.decoder.layers.5.self_attention.linear_qkv.bias
        module.decoder.layers.5.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.21.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.20.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.5.mlp.linear_fc1.weight
        module.decoder.layers.4.mlp.linear_fc1.weight
        module.decoder.layers.3.mlp.linear_fc1.layer_norm_bias
        module.decoder.layers.0.mlp.linear_fc1.weight
        module.decoder.layers.2.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.9.self_attention.linear_qkv.layer_norm_bias
        module.decoder.layers.8.self_attention.linear_qkv.bias
        module.decoder.layers.0.mlp.linear_fc1.layer_norm_bias
INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.00015, min_lr=1e-05, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.01, fp16=True, bf16=False, params_dtype=torch.float16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=False, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x2b35f3457d90>, config_logger_dir='')
INFO:megatron.core.optimizer_param_scheduler:> learning rate decay style: cosine
WARNING: could not find the metadata file checkpoints/gpt2_345m_dist_mp/latest_checkpointed_iteration.txt
    will not load any checkpoints and will start from random
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
(min, max) time across ranks (ms):
    load-checkpoint ................................: (0.80, 0.81)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-09-13 13:50:36
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      3200
    validation: 480
    test:       160
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.949), (0.949, 0.999), (0.999, 1.0)]
> building train, validation, and test datasets for GPT ...
INFO:megatron.core.datasets.blended_megatron_dataset_builder:Building dataset splits with cls=GPTDataset, sizes=(3200, 480, 160), and config=GPTDatasetConfig(random_seed=1234, sequence_length=1024, blend=(['my-gpt2_text_document'], None), blend_per_split=[None, None, None], renormalize_blend_weights=False, split='949,50,1', split_matrix=[(0, 0.949), (0.949, 0.999), (0.999, 1.0)], num_dataset_builder_threads=1, path_to_cache=None, mmap_bin_files=True, mock=False, tokenizer=<megatron.training.tokenizer.tokenizer._GPT2BPETokenizer object at 0x2b35f34574f0>, reset_position_ids=False, reset_attention_mask=False, eod_mask_loss=False, create_attention_mask=True, drop_last_partial_validation_sequence=True, add_extra_token_to_sequence=True, s3_cache_path=None)
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from my-gpt2_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:    Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:    Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:    Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 10000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 10000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:        Load the document index from a375193ee7bd0ffbfa0d131aff630f11-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the sample index from a375193ee7bd0ffbfa0d131aff630f11-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the shuffle index from a375193ee7bd0ffbfa0d131aff630f11-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 3570
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset valid indices
INFO:megatron.core.datasets.gpt_dataset:        Load the document index from e9d835bb86bcaca4cd08b4977794f406-GPTDataset-valid-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the sample index from e9d835bb86bcaca4cd08b4977794f406-GPTDataset-valid-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the shuffle index from e9d835bb86bcaca4cd08b4977794f406-GPTDataset-valid-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 530
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset test indices
INFO:megatron.core.datasets.gpt_dataset:        Load the document index from fceb23beb87bf7aa15e27415203c9636-GPTDataset-test-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the sample index from fceb23beb87bf7aa15e27415203c9636-GPTDataset-test-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the shuffle index from fceb23beb87bf7aa15e27415203c9636-GPTDataset-test-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 161
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-09-13 13:50:37
done with setup ...
training ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (1990.13, 1990.64)
    train/valid/test-data-iterators-setup ..........: (54.45, 431.36)
[before the start of training step] datetime: 2024-09-13 13:50:37
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:609: UserWarning: async_grad_allreduce is deprecated, not in use anymore and will be fully removed with 0.10.0. Please use allreduce_dgrad instead.
  warnings.warn(
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:609: UserWarning: async_grad_allreduce is deprecated, not in use anymore and will be fully removed with 0.10.0. Please use allreduce_dgrad instead.
  warnings.warn(
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:475: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
  handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:475: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
  handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:609: UserWarning: async_grad_allreduce is deprecated, not in use anymore and will be fully removed with 0.10.0. Please use allreduce_dgrad instead.
  warnings.warn(
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:609: UserWarning: async_grad_allreduce is deprecated, not in use anymore and will be fully removed with 0.10.0. Please use allreduce_dgrad instead.
  warnings.warn(
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:475: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
  handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/test/Megatron-LM/megatron/core/tensor_parallel/layers.py:475: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
  handle = torch.distributed._reduce_scatter_base(
 [2024-09-13 13:50:58] iteration       10/     200 | consumed samples:          160 | elapsed time per iteration (ms): 2159.6 | learning rate: 0.000000E+00 | global batch size:    16 | loss scale: 8388608.0 | number of skipped iterations:  10 | number of nan iterations:   0 |
 [2024-09-13 13:51:03] iteration       20/     200 | consumed samples:          320 | elapsed time per iteration (ms): 523.6 | learning rate: 2.343750E-07 | global batch size:    16 | lm loss: 1.095584E+01 | loss scale: 262144.0 | grad norm: 24.684 | number of skipped iterations:   5 | number of nan iterations:   0 |
Number of parameters in transformer layers in billions:  0.30
Number of parameters in embedding layers in billions: 0.05
Total number of parameters in billions: 0.35
Number of parameters in most loaded shard in billions: 0.1769
Theoretical memory footprints: weight and optimizer=3036.11 MB
[Rank 1] (after 20 iterations) memory (MB) | allocated: 3431.298828125 | max allocated: 5090.46728515625 | reserved: 5104.0 | max reserved: 5104.0
[Rank 0] (after 20 iterations) memory (MB) | allocated: 3431.298828125 | max allocated: 5090.46728515625 | reserved: 5104.0 | max reserved: 5104.0
 [2024-09-13 13:51:08] iteration       30/     200 | consumed samples:          480 | elapsed time per iteration (ms): 486.1 | learning rate: 7.031250E-07 | global batch size:    16 | lm loss: 1.066842E+01 | loss scale: 262144.0 | grad norm: 16.948 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 13:51:13] iteration       40/     200 | consumed samples:          640 | elapsed time per iteration (ms): 463.6 | learning rate: 1.171875E-06 | global batch size:    16 | lm loss: 1.002888E+01 | loss scale: 262144.0 | grad norm: 7.755 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 13:51:18] iteration       50/     200 | consumed samples:          800 | elapsed time per iteration (ms): 456.1 | learning rate: 1.640625E-06 | global batch size:    16 | lm loss: 9.578815E+00 | loss scale: 262144.0 | grad norm: 4.169 | number of skipped iterations:   0 | number of nan iterations:   0 |
saving checkpoint at iteration      50 to checkpoints/gpt2_345m_dist_mp in torch_dist format
[rank0]: Traceback (most recent call last):
[rank0]:   File "/scratch/qualis/test/Megatron-LM/pretrain_gpt.py", line 264, in <module>
[rank0]:     pretrain(
[rank0]:   File "/scratch/qualis/test/Megatron-LM/megatron/training/training.py", line 348, in pretrain
[rank0]:     iteration, num_floating_point_operations_so_far = train(
[rank0]:   File "/scratch/qualis/test/Megatron-LM/megatron/training/training.py", line 1361, in train
[rank0]:     save_checkpoint_and_time(iteration, model, optimizer,
[rank0]:   File "/scratch/qualis/test/Megatron-LM/megatron/training/training.py", line 1065, in save_checkpoint_and_time
[rank0]:     save_checkpoint(iteration, model, optimizer, opt_param_scheduler,
[rank0]:   File "/scratch/qualis/test/Megatron-LM/megatron/training/checkpointing.py", line 401, in save_checkpoint
[rank0]:     state_dict = generate_state_dict(
[rank0]:   File "/scratch/qualis/test/Megatron-LM/megatron/training/checkpointing.py", line 613, in generate_state_dict
[rank0]:     state_dict['optimizer'] = (optimizer.sharded_state_dict(state_dict, **(optim_sd_kwargs or {}))
[rank0]:   File "/scratch/qualis/test/Megatron-LM/megatron/core/optimizer/optimizer.py", line 654, in sharded_state_dict
[rank0]:     optim_state_to_sharding_state(
[rank0]:   File "/scratch/qualis/test/Megatron-LM/megatron/core/dist_checkpointing/optimizer.py", line 123, in optim_state_to_sharding_state
[rank0]:     sharded_state[param_id][state_key] = make_sharded_optimizer_tensor(
[rank0]:   File "/scratch/qualis/test/Megatron-LM/megatron/core/dist_checkpointing/optimizer.py", line 86, in make_sharded_optimizer_tensor
[rank0]:     tuple(optim_param.shape) == model_param.local_shape
[rank0]: AttributeError: 'NoneType' object has no attribute 'shape'
[rank1]: Traceback (most recent call last):
[rank1]:   File "/scratch/qualis/test/Megatron-LM/pretrain_gpt.py", line 264, in <module>
[rank1]:     pretrain(
[rank1]:   File "/scratch/qualis/test/Megatron-LM/megatron/training/training.py", line 348, in pretrain
[rank1]:     iteration, num_floating_point_operations_so_far = train(
[rank1]:   File "/scratch/qualis/test/Megatron-LM/megatron/training/training.py", line 1361, in train
[rank1]:     save_checkpoint_and_time(iteration, model, optimizer,
[rank1]:   File "/scratch/qualis/test/Megatron-LM/megatron/training/training.py", line 1065, in save_checkpoint_and_time
[rank1]:     save_checkpoint(iteration, model, optimizer, opt_param_scheduler,
[rank1]:   File "/scratch/qualis/test/Megatron-LM/megatron/training/checkpointing.py", line 401, in save_checkpoint
[rank1]:     state_dict = generate_state_dict(
[rank1]:   File "/scratch/qualis/test/Megatron-LM/megatron/training/checkpointing.py", line 613, in generate_state_dict
[rank1]:     state_dict['optimizer'] = (optimizer.sharded_state_dict(state_dict, **(optim_sd_kwargs or {}))
[rank1]:   File "/scratch/qualis/test/Megatron-LM/megatron/core/optimizer/optimizer.py", line 654, in sharded_state_dict
[rank1]:     optim_state_to_sharding_state(
[rank1]:   File "/scratch/qualis/test/Megatron-LM/megatron/core/dist_checkpointing/optimizer.py", line 123, in optim_state_to_sharding_state
[rank1]:     sharded_state[param_id][state_key] = make_sharded_optimizer_tensor(
[rank1]:   File "/scratch/qualis/test/Megatron-LM/megatron/core/dist_checkpointing/optimizer.py", line 86, in make_sharded_optimizer_tensor
[rank1]:     tuple(optim_param.shape) == model_param.local_shape
[rank1]: AttributeError: 'NoneType' object has no attribute 'shape'
[rank1]:[W913 13:51:18.226165487 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W0913 13:51:27.911000 47065146630080 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 52065 closing signal SIGTERM
E0913 13:51:28.128000 47065146630080 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 52066) of binary: /scratch/qualis/miniconda3/envs/megatron/bin/python
Traceback (most recent call last):
  File "/scratch/qualis/miniconda3/envs/megatron/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
  File "/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-13_13:51:27
  host      : gpu36.eth
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 52066)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Any comment or suggestion would be appreciated.

hwang2006 commented 2 months ago

Hi, here is the same code working fine against the previous Megatron-LM.

$  CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 2  pretrain_gpt.py     --tensor-model-parallel-size 2     --pipeline-model-parallel-size 1         --num-layers 24     --hidden-size 1024     --num-attention-heads 16     --seq-length 1024     --max-position-embeddings 1024     --micro-batch-size 4     --global-batch-size 16     --lr 0.00015     --train-iters 200     --lr-decay-iters 320000     --lr-decay-style cosine     --min-lr 1.0e-5     --weight-decay 1e-2     --lr-warmup-fraction .01     --clip-grad 1.0     --fp16 --data-path  my-gpt2_text_document     --vocab-file vocab.json     --merge-file merges.txt     --split 949,50,1 --log-interval 10     --save-interval 50    --eval-interval 100     --eval-iters 10 --distributed-backend nccl  --save checkpoints/gpt2_345m_dist_mp     --load  checkpoints/gpt2_345m_dist_mp --attention-softmax-in-fp32 --sequence-parallel
W0913 14:09:55.827000 47463259365312 torch/distributed/run.py:779]
W0913 14:09:55.827000 47463259365312 torch/distributed/run.py:779] *****************************************
W0913 14:09:55.827000 47463259365312 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0913 14:09:55.827000 47463259365312 torch/distributed/run.py:779] *****************************************
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:260: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:271: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:341: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:378: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:260: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:271: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:341: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:378: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
using world size: 2, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 2, pipeline-model-parallel size: 1
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
WARNING: Setting args.check_for_nan_in_loss_and_grad to False since dynamic loss scaling is being used
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... True
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... True
  auto_detect_ckpt_format ......................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  bias_swiglu_fusion .............................. True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. False
  check_weight_hash_across_dp_replicas_interval ... None
  ckpt_fully_parallel_save ........................ False
  ckpt_step ....................................... None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  create_attention_mask_in_dataloader ............. True
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... ['my-gpt2_text_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  ddp_bucket_size ................................. None
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  decoupled_lr .................................... None
  decoupled_min_lr ................................ None
  delay_grad_reduce ............................... True
  delay_param_gather .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  disable_straggler_on_startup .................... False
  dist_ckpt_format ................................ torch_dist
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  enable_one_logger ............................... False
  encoder_num_layers .............................. 24
  encoder_seq_length .............................. 1024
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 100
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  ffn_hidden_size ................................. 4096
  finetune ........................................ False
  fp16 ............................................ True
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 16
  gradient_accumulation_fusion .................... True
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 1024
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 64
  lazy_mpu_init ................................... None
  load ............................................ checkpoints/gpt2_345m_dist_mp
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 10
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_progress .................................... False
  log_straggler ................................... False
  log_throughput .................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.00015
  lr_decay_iters .................................. 320000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. 0.01
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 1024
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... merges.txt
  micro_batch_size ................................ 4
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-05
  mmap_bin_files .................................. True
  mock_data ....................................... False
  moe_aux_loss_coeff .............................. 0.0
  moe_grouped_gemm ................................ False
  moe_input_jitter_eps ............................ None
  moe_per_layer_logging ........................... False
  moe_router_load_balancing_type .................. aux_loss
  moe_router_topk ................................. 2
  moe_token_dispatcher_type ....................... allgather
  moe_token_dropping .............................. False
  moe_z_loss_coeff ................................ None
  nccl_communicator_config_path ................... None
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... LayerNorm
  num_attention_heads ............................. 16
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_workers ..................................... 2
  one_logger_entity ............................... hwinf_dcm
  one_logger_project .............................. e2e-tracking
  one_logger_run_name ............................. None
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.float16
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... learned_absolute
  pretrained_checkpoint ........................... None
  profile ......................................... False
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  qk_layernorm .................................... False
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_attention_gate ............................ 1
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_project_dir ............................... None
  retro_verify_neighbor_count ..................... True
  rotary_interleaved .............................. False
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ checkpoints/gpt2_345m_dist_mp
  save_interval ................................... 50
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 1024
  sequence_parallel ............................... True
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  spec ............................................ None
  split ........................................... 949,50,1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  straggler_ctrlr_port ............................ 65535
  straggler_minmax_count .......................... 1
  swiglu .......................................... False
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 2
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_mode ....................................... False
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. GPT2BPETokenizer
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_ag .............................. True
  tp_comm_overlap_cfg ............................. None
  tp_comm_overlap_rs .............................. True
  tp_comm_overlap_rs_dgrad ........................ False
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... 200
  train_samples ................................... None
  transformer_impl ................................ transformer_engine
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_dist_ckpt ................................... False
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. False
  use_mcore_models ................................ False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  use_tp_pp_dp_mapping ............................ False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... vocab.json
  vocab_size ...................................... None
  wandb_exp_name ..................................
  wandb_project ...................................
  wandb_save_dir ..................................
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 2
  yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 4
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 175 dummy tokens (new size: 50432)
> initializing torch distributed ...
> initialized tensor model parallel with size 2
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory `/scratch/qualis/test/Megatron-LM.bak/megatron/core/datasets'
make: Nothing to be done for `default'.
make: Leaving directory `/scratch/qualis/test/Megatron-LM.bak/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.063 seconds
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 3.407 seconds
[rank1]:[W913 14:10:04.146531972 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank0]:[W913 14:10:04.146585101 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
time to initialize megatron (seconds): 5.393
[after megatron is initialized] datetime: 2024-09-13 14:10:06
building GPT model ...
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 178100224
INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with DistributedDataParallelConfig: DistributedDataParallelConfig(grad_reduce_in_fp32=False, overlap_grad_reduce=False, use_distributed_optimizer=False, check_for_nan_in_grad=False, bucket_size=None)
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (178100224 elements):
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.6.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.final_norm.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.embedding.word_embeddings.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.0.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.2.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.14.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.9.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.7.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.1.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.final_norm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.8.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.12.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.5.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.3.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.23.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.17.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.embedding.position_embeddings.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.18.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.22.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.21.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.19.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.16.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.15.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.13.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.20.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.11.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.10.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.language_model.encoder.layers.4.layernorm_mlp.layer_norm_bias
INFO:megatron.core.optimizer:Setting up optimizer with OptimizerConfig: OptimizerConfig(optimizer='adam', lr=0.00015, min_lr=1e-05, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.01, fp16=True, bf16=False, params_dtype=torch.float16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=False, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x2b8d69ceb820>)
> learning rate decay style: cosine
 > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 178100224
WARNING: could not find the metadata file checkpoints/gpt2_345m_dist_mp/latest_checkpointed_iteration.txt
    will not load any checkpoints and will start from random
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
(min, max) time across ranks (ms):
    load-checkpoint ................................: (0.49, 0.52)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-09-13 14:10:10
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      3200
    validation: 480
    test:       160
INFO:megatron.core.datasets.blended_megatron_dataset_config:mock = False
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.949), (0.949, 0.999), (0.999, 1.0)]
> building train, validation, and test datasets for GPT ...
WARNING:megatron.core.datasets.blended_megatron_dataset_builder:Building dataset splits with cls=GPTDataset, sizes=(3200, 480, 160), and config=GPTDatasetConfig(random_seed=1234, sequence_length=1024, blend=(['my-gpt2_text_document'], None), blend_per_split=[None, None, None], split='949,50,1', split_matrix=[(0, 0.949), (0.949, 0.999), (0.999, 1.0)], path_to_cache=None, mmap_bin_files=True, mock=False, tokenizer=<megatron.training.tokenizer.tokenizer._GPT2BPETokenizer object at 0x2b8d69d5dc30>, reset_position_ids=False, reset_attention_mask=False, eod_mask_loss=False, create_attention_mask=True)
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from my-gpt2_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:    Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:    Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:    Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 10000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 10000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:        Load the document index from a375193ee7bd0ffbfa0d131aff630f11-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the sample index from a375193ee7bd0ffbfa0d131aff630f11-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the shuffle index from a375193ee7bd0ffbfa0d131aff630f11-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 3570
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset valid indices
INFO:megatron.core.datasets.gpt_dataset:        Load the document index from e9d835bb86bcaca4cd08b4977794f406-GPTDataset-valid-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the sample index from e9d835bb86bcaca4cd08b4977794f406-GPTDataset-valid-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the shuffle index from e9d835bb86bcaca4cd08b4977794f406-GPTDataset-valid-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 530
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset test indices
INFO:megatron.core.datasets.gpt_dataset:        Load the document index from fceb23beb87bf7aa15e27415203c9636-GPTDataset-test-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the sample index from fceb23beb87bf7aa15e27415203c9636-GPTDataset-test-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:        Load the shuffle index from fceb23beb87bf7aa15e27415203c9636-GPTDataset-test-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 161
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-09-13 14:10:10
done with setup ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (4529.59, 4538.06)
    train/valid/test-data-iterators-setup ..........: (34.88, 351.35)
training ...
[before the start of training step] datetime: 2024-09-13 14:10:10
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
  handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
  handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
  handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
  handle = torch.distributed._reduce_scatter_base(
 [2024-09-13 14:10:23] iteration       10/     200 | consumed samples:          160 | elapsed time per iteration (ms): 1216.5 | learning rate: 0.000000E+00 | global batch size:    16 | loss scale: 8388608.0 | number of skipped iterations:  10 | number of nan iterations:   0 |
Number of parameters in transformer layers in billions:  0.30
 [2024-09-13 14:10:28] iteration       20/     200 | consumed samples:          320 | elapsed time per iteration (ms): 488.6 | learning rate: 2.343750E-07 | global batch size:    16 | lm loss: 1.105440E+01 | loss scale: 262144.0 | grad norm: 24.377 | number of skipped iterations:   5 | number of nan iterations:   0 |Number of parameters in embedding layers in billions: 0.05
Total number of parameters in billions: 0.35
Number of parameters in most loaded shard in billions: 0.1769

Theoretical memory footprints: weight and optimizer=3036.11 MB
[Rank 1] (after 20 iterations) memory (MB) | allocated: 3431.298828125 | max allocated: 5484.498046875 | reserved: 5512.0 | max reserved: 5512.0
[Rank 0] (after 20 iterations) memory (MB) | allocated: 3431.298828125 | max allocated: 5484.498046875 | reserved: 5512.0 | max reserved: 5512.0
 [2024-09-13 14:10:32] iteration       30/     200 | consumed samples:          480 | elapsed time per iteration (ms): 438.3 | learning rate: 7.031250E-07 | global batch size:    16 | lm loss: 1.076019E+01 | loss scale: 262144.0 | grad norm: 18.259 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:10:36] iteration       40/     200 | consumed samples:          640 | elapsed time per iteration (ms): 429.6 | learning rate: 1.171875E-06 | global batch size:    16 | lm loss: 1.007059E+01 | loss scale: 262144.0 | grad norm: 8.208 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:10:41] iteration       50/     200 | consumed samples:          800 | elapsed time per iteration (ms): 435.3 | learning rate: 1.640625E-06 | global batch size:    16 | lm loss: 9.590485E+00 | loss scale: 262144.0 | grad norm: 4.108 | number of skipped iterations:   0 | number of nan iterations:   0 |
saving checkpoint at iteration      50 to checkpoints/gpt2_345m_dist_mp in torch format
  successfully saved checkpoint at iteration      50 to checkpoints/gpt2_345m_dist_mp
(min, max) time across ranks (ms):
    save-checkpoint ................................: (2592.46, 2592.53)
 [2024-09-13 14:10:47] iteration       60/     200 | consumed samples:          960 | elapsed time per iteration (ms): 433.3 | learning rate: 2.109375E-06 | global batch size:    16 | lm loss: 9.360849E+00 | loss scale: 262144.0 | grad norm: 3.096 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:10:52] iteration       70/     200 | consumed samples:         1120 | elapsed time per iteration (ms): 435.1 | learning rate: 2.578125E-06 | global batch size:    16 | lm loss: 9.242426E+00 | loss scale: 262144.0 | grad norm: 2.803 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:10:56] iteration       80/     200 | consumed samples:         1280 | elapsed time per iteration (ms): 424.8 | learning rate: 3.046875E-06 | global batch size:    16 | lm loss: 9.108851E+00 | loss scale: 262144.0 | grad norm: 3.376 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:11:01] iteration       90/     200 | consumed samples:         1440 | elapsed time per iteration (ms): 446.8 | learning rate: 3.515625E-06 | global batch size:    16 | lm loss: 8.897541E+00 | loss scale: 262144.0 | grad norm: 2.991 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:11:05] iteration      100/     200 | consumed samples:         1600 | elapsed time per iteration (ms): 432.7 | learning rate: 3.984375E-06 | global batch size:    16 | lm loss: 8.763078E+00 | loss scale: 262144.0 | grad norm: 2.993 | number of skipped iterations:   0 | number of nan iterations:   0 |
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:178: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:178: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:178: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:178: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
  return func(*args, **kwargs)
(min, max) time across ranks (ms):
    evaluate .......................................: (1718.39, 1718.52)
-----------------------------------------------------------------------------------------------
 validation loss at iteration 100 | lm loss value: 8.691130E+00 | lm loss PPL: 5.949900E+03 |
-----------------------------------------------------------------------------------------------
saving checkpoint at iteration     100 to checkpoints/gpt2_345m_dist_mp in torch format
  successfully saved checkpoint at iteration     100 to checkpoints/gpt2_345m_dist_mp
(min, max) time across ranks (ms):
    save-checkpoint ................................: (2515.99, 2516.06)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
  handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
  handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
 [2024-09-13 14:11:13] iteration      110/     200 | consumed samples:         1760 | elapsed time per iteration (ms): 433.8 | learning rate: 4.453125E-06 | global batch size:    16 | lm loss: 8.633624E+00 | loss scale: 262144.0 | grad norm: 2.537 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:11:18] iteration      120/     200 | consumed samples:         1920 | elapsed time per iteration (ms): 439.2 | learning rate: 4.921875E-06 | global batch size:    16 | lm loss: 8.542423E+00 | loss scale: 262144.0 | grad norm: 2.307 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:11:22] iteration      130/     200 | consumed samples:         2080 | elapsed time per iteration (ms): 438.9 | learning rate: 5.390625E-06 | global batch size:    16 | lm loss: 8.467690E+00 | loss scale: 262144.0 | grad norm: 2.970 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:11:27] iteration      140/     200 | consumed samples:         2240 | elapsed time per iteration (ms): 431.8 | learning rate: 5.859375E-06 | global batch size:    16 | lm loss: 8.388003E+00 | loss scale: 262144.0 | grad norm: 2.117 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:11:31] iteration      150/     200 | consumed samples:         2400 | elapsed time per iteration (ms): 423.6 | learning rate: 6.328125E-06 | global batch size:    16 | lm loss: 8.318639E+00 | loss scale: 262144.0 | grad norm: 2.781 | number of skipped iterations:   0 | number of nan iterations:   0 |
saving checkpoint at iteration     150 to checkpoints/gpt2_345m_dist_mp in torch format
  successfully saved checkpoint at iteration     150 to checkpoints/gpt2_345m_dist_mp
(min, max) time across ranks (ms):
    save-checkpoint ................................: (2453.37, 2453.39)
 [2024-09-13 14:11:37] iteration      160/     200 | consumed samples:         2560 | elapsed time per iteration (ms): 422.0 | learning rate: 6.796875E-06 | global batch size:    16 | lm loss: 8.229613E+00 | loss scale: 262144.0 | grad norm: 1.963 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:11:42] iteration      170/     200 | consumed samples:         2720 | elapsed time per iteration (ms): 422.5 | learning rate: 7.265625E-06 | global batch size:    16 | lm loss: 8.162241E+00 | loss scale: 262144.0 | grad norm: 2.579 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:11:46] iteration      180/     200 | consumed samples:         2880 | elapsed time per iteration (ms): 436.7 | learning rate: 7.734375E-06 | global batch size:    16 | lm loss: 8.066425E+00 | loss scale: 262144.0 | grad norm: 2.065 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:11:50] iteration      190/     200 | consumed samples:         3040 | elapsed time per iteration (ms): 421.9 | learning rate: 8.203125E-06 | global batch size:    16 | lm loss: 8.001675E+00 | loss scale: 262144.0 | grad norm: 2.096 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-09-13 14:11:54] iteration      200/     200 | consumed samples:         3200 | elapsed time per iteration (ms): 422.7 | learning rate: 8.671875E-06 | global batch size:    16 | lm loss: 7.900609E+00 | loss scale: 262144.0 | grad norm: 2.095 | number of skipped iterations:   0 | number of nan iterations:   0 |
(min, max) time across ranks (ms):
    evaluate .......................................: (1599.28, 1599.36)
-----------------------------------------------------------------------------------------------
 validation loss at iteration 200 | lm loss value: 7.888445E+00 | lm loss PPL: 2.666296E+03 |
-----------------------------------------------------------------------------------------------
saving checkpoint at iteration     200 to checkpoints/gpt2_345m_dist_mp in torch format
  successfully saved checkpoint at iteration     200 to checkpoints/gpt2_345m_dist_mp
(min, max) time across ranks (ms):
    save-checkpoint ................................: (2607.55, 2607.58)
[after training is done] datetime: 2024-09-13 14:11:59
Evaluating on 160 samples
Evaluating iter 1/10
Evaluating iter 2/10
Evaluating iter 3/10
Evaluating iter 4/10
Evaluating iter 5/10
Evaluating iter 6/10
Evaluating iter 7/10
Evaluating iter 8/10
Evaluating iter 9/10
Evaluating iter 10/10
(min, max) time across ranks (ms):
    evaluate .......................................: (1603.27, 1603.32)
-----------------------------------------------------------------------------------------------------------------
 validation loss at iteration 200 on validation set | lm loss value: 7.889119E+00 | lm loss PPL: 2.668093E+03 |
-----------------------------------------------------------------------------------------------------------------
Evaluating on 160 samples
Evaluating iter 1/10
Evaluating iter 2/10
Evaluating iter 3/10
Evaluating iter 4/10
Evaluating iter 5/10
Evaluating iter 6/10
Evaluating iter 7/10
Evaluating iter 8/10
Evaluating iter 9/10
Evaluating iter 10/10
(min, max) time across ranks (ms):
    evaluate .......................................: (1639.27, 1639.28)
-----------------------------------------------------------------------------------------------------------
 validation loss at iteration 200 on test set | lm loss value: 7.695062E+00 | lm loss PPL: 2.197469E+03 |
-----------------------------------------------------------------------------------------------------------
[rank1]:[W913 14:12:02.636544709 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[rank0]:[W913 14:12:02.670127115 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
hwang2006 commented 2 months ago

My simple workaround was checking out an old branch and ran it again. It worked! I don't know how it is workding. Any comment would be appreciated.

(megatron) $ git checkout core_r0.5.0
Branch core_r0.5.0 set up to track remote branch core_r0.5.0 from origin.
Switched to a new branch 'core_r0.5.0'
(megatron) $ git branch 
* core_r0.5.0
  main
(megatron) $  CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 2  pretrain_gpt.py     --tensor-model-parallel-size 2     --pipeline-model-parallel-size 1         --num-layers 24     --hidden-size 1024     --num-attention-heads 16     --seq-length 1024     --max-position-embeddings 1024     --micro-batch-size 4     --global-batch-size 16     --lr 0.00015     --train-iters 200     --lr-decay-iters 320000     --lr-decay-style cosine     --min-lr 1.0e-5     --weight-decay 1e-2     --lr-warmup-fraction .01     --clip-grad 1.0     --fp16 --data-path  my-gpt2_text_document     --vocab-file vocab.json     --merge-file merges.txt     --split 949,50,1 --log-interval 10     --save-interval 50    --eval-interval 100     --eval-iters 10 --distributed-backend nccl  --save checkpoints/gpt2_345m_dist_mp     --load  checkpoints/gpt2_345m_dist_mp --attention-softmax-in-fp32 --sequence-parallel
jgcb00 commented 2 months ago

Same issue here, Failed with :

[rank21]: Traceback (most recent call last):
[rank21]:   File "Megatron-LM/pretrain_gpt.py", line 264, in <module>
[rank21]:     pretrain(
[rank21]:   File "Megatron-LM/megatron/training/training.py", line 355, in pretrain
[rank21]:     iteration, num_floating_point_operations_so_far = train(
[rank21]:                                                       ^^^^^^
[rank21]:   File "Megatron-LM/megatron/training/training.py", line 1368, in train
[rank21]:     save_checkpoint_and_time(iteration, model, optimizer,
[rank21]:   File "Megatron-LM/megatron/training/training.py", line 1072, in save_checkpoint_and_time
[rank21]:     save_checkpoint(iteration, model, optimizer, opt_param_scheduler,
[rank21]:   File "Megatron-LM/megatron/training/checkpointing.py", line 401, in save_checkpoint
[rank21]:     state_dict = generate_state_dict(
[rank21]:                  ^^^^^^^^^^^^^^^^^^^^
[rank21]:   File "Megatron-LM/megatron/training/checkpointing.py", line 613, in generate_state_dict
[rank21]:     state_dict['optimizer'] = (optimizer.sharded_state_dict(state_dict, **(optim_sd_kwargs or {}))
[rank21]:                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank21]:   File "Megatron-LM/megatron/core/optimizer/optimizer.py", line 654, in sharded_state_dict
[rank21]:     optim_state_to_sharding_state(
[rank21]:   File "Megatron-LM/megatron/core/dist_checkpointing/optimizer.py", line 120, in optim_state_to_sharding_state
[rank21]:     sharded_state[param_id][state_key] = make_sharded_optimizer_tensor(
[rank21]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank21]:   File "Megatron-LM/megatron/core/dist_checkpointing/optimizer.py", line 83, in make_sharded_optimizer_tensor
[rank21]:     tuple(optim_param.shape) == model_param.local_shape
[rank21]:           ^^^^^^^^^^^^^^^^^
[rank21]: AttributeError: 'NoneType' object has no attribute 'shape'

Any help ? I can try to fix it but would like some insight to get started I don't want to downgrade as I want to benchmark against Mamba....

jgcb00 commented 2 months ago

I tried using --use-distributed-optimizer but failed also on a error :

[rank21]:   File "Megatron-LM/megatron/core/optimizer/distrib_optimizer.py", line 1159, in sharded_param_state_fs_model_space
[rank21]:     dtype=state_ten.dtype,
[rank21]:           ^^^^^^^^^^^^^^^
[rank21]: AttributeError: 'NoneType' object has no attribute 'dtype'

Looks like the two error are link !

jgcb00 commented 2 months ago

I changed the checkpoint format from torch_dist to torch and seems to do the work, I haven't tried to restart the training from a backup but no error throw during model saving

hwang2006 commented 2 months ago

Yes, it seemed to work for me as well by setting the command argument --ckpt-format to torch explicitly. BTW, the default checkpoint format is torch_dist.

(megatron) $ CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 2 --master_port 12345 pretrain_gpt.py --tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 4 --global-batch-size 16 --lr 0.00015 --train-iters 200 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --lr-warmup-fraction .01 --clip-grad 1.0 --fp16 --data-path my-gpt2_text_document --vocab-file vocab.json --merge-file merges.txt --split 949,50,1 --log-interval 10 --save-interval 50 --eval-interval 100 --eval-iters 10 --distributed-backend nccl --save checkpoints/gpt2_345m_dist_mp --load checkpoints/gpt2_345m_dist_mp --attention-softmax-in-fp32 --sequence-parallel --ckpt-format torch . . . [2024-09-23 08:51:58] iteration 50/ 200 | consumed samples: 800 | elapsed time per iteration (ms): 1133.0 | learning rate: 1.640625E-06 | global batch size: 16 | lm loss: 9.557187E+00 | loss scale: 262144.0 | grad norm: 3.954 | number of skipped iterations: 0 | number of nan iterations: 0 | saving checkpoint at iteration 50 to checkpoints/gpt2_345m_dist_mp in torch format successfully saved checkpoint from iteration 50 to checkpoints/gpt2_345m_dist_mp (min, max) time across ranks (ms): save-checkpoint ................................: (3445.21, 3445.43) . . .

lifeiteng commented 1 month ago

I tried using --use-distributed-optimizer but failed also on a error :

[rank21]:   File "Megatron-LM/megatron/core/optimizer/distrib_optimizer.py", line 1159, in sharded_param_state_fs_model_space
[rank21]:     dtype=state_ten.dtype,
[rank21]:           ^^^^^^^^^^^^^^^
[rank21]: AttributeError: 'NoneType' object has no attribute 'dtype'

Looks like the two error are link !

install newer TE

https://github.com/NVIDIA/TransformerEngine/pull/1130

jgcb00 commented 1 month ago

Hi, I tryed by installing the lastest version with :

pip install git+https://github.com/NVIDIA/TransformerEngine.git@main

It might work but now the training just don't start and I have a new error :

TypeError: flash_attn_func() got an unexpected keyword argument 'block_table'

I will wait for a new stable release to try again

haolin-nju commented 2 weeks ago

Hi, I tryed by installing the lastest version with :

pip install git+https://github.com/NVIDIA/TransformerEngine.git@main

It might work but now the training just don't start and I have a new error :

TypeError: flash_attn_func() got an unexpected keyword argument 'block_table'

I will wait for a new stable release to try again

Same issue. I think it is originated from the Transformer Engine (TE). I am able to reproduce the bug with Megatron-Core r0.9.0 and TE v1.10 in training a mixtral model. I found a PR to resolve the issue in TE#1130. This PR is included in TE v1.11. Perhaps you could try to upgrade the TE version to 1.11.

jgcb00 commented 2 weeks ago

Indeed, it works with transformer engine v1.11 thanks