Open hwang2006 opened 2 months ago
Hi, here is the same code working fine against the previous Megatron-LM.
$ CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 2 pretrain_gpt.py --tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 4 --global-batch-size 16 --lr 0.00015 --train-iters 200 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --lr-warmup-fraction .01 --clip-grad 1.0 --fp16 --data-path my-gpt2_text_document --vocab-file vocab.json --merge-file merges.txt --split 949,50,1 --log-interval 10 --save-interval 50 --eval-interval 100 --eval-iters 10 --distributed-backend nccl --save checkpoints/gpt2_345m_dist_mp --load checkpoints/gpt2_345m_dist_mp --attention-softmax-in-fp32 --sequence-parallel
W0913 14:09:55.827000 47463259365312 torch/distributed/run.py:779]
W0913 14:09:55.827000 47463259365312 torch/distributed/run.py:779] *****************************************
W0913 14:09:55.827000 47463259365312 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0913 14:09:55.827000 47463259365312 torch/distributed/run.py:779] *****************************************
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:260: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:271: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:341: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:378: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:260: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:271: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:341: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:378: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
using world size: 2, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 2, pipeline-model-parallel size: 1
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
WARNING: Setting args.check_for_nan_in_loss_and_grad to False since dynamic loss scaling is being used
using torch.float16 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. False
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
add_bias_linear ................................. True
add_position_embedding .......................... True
add_qkv_bias .................................... False
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... False
apply_residual_connection_post_layernorm ........ False
apply_rope_fusion ............................... True
async_tensor_model_parallel_allreduce ........... False
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... True
auto_detect_ckpt_format ......................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ False
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ True
bias_swiglu_fusion .............................. True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
check_for_nan_in_loss_and_grad .................. False
check_weight_hash_across_dp_replicas_interval ... None
ckpt_fully_parallel_save ........................ False
ckpt_step ....................................... None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
clone_scatter_output_in_embedding ............... True
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
context_parallel_size ........................... 1
create_attention_mask_in_dataloader ............. True
data_cache_path ................................. None
data_parallel_random_init ....................... False
data_parallel_size .............................. 1
data_path ....................................... ['my-gpt2_text_document']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
ddp_bucket_size ................................. None
decoder_num_layers .............................. None
decoder_seq_length .............................. None
decoupled_lr .................................... None
decoupled_min_lr ................................ None
delay_grad_reduce ............................... True
delay_param_gather .............................. False
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
disable_straggler_on_startup .................... False
dist_ckpt_format ................................ torch_dist
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout_minutes ..................... 10
embedding_path .................................. None
empty_unused_memory_level ....................... 0
enable_one_logger ............................... False
encoder_num_layers .............................. 24
encoder_seq_length .............................. 1024
end_weight_decay ................................ 0.01
eod_mask_loss ................................... False
eval_interval ................................... 100
eval_iters ...................................... 10
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
expert_model_parallel_size ...................... 1
ffn_hidden_size ................................. 4096
finetune ........................................ False
fp16 ............................................ True
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8 ............................................. None
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
global_batch_size ............................... 16
gradient_accumulation_fusion .................... True
group_query_attention ........................... False
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.1
hidden_size ..................................... 1024
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
kv_channels ..................................... 64
lazy_mpu_init ................................... None
load ............................................ checkpoints/gpt2_345m_dist_mp
local_rank ...................................... None
log_batch_size_to_tensorboard ................... False
log_interval .................................... 10
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_progress .................................... False
log_straggler ................................... False
log_throughput .................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 0.00015
lr_decay_iters .................................. 320000
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. 0.01
lr_warmup_init .................................. 0.0
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
manual_gc ....................................... False
manual_gc_eval .................................. True
manual_gc_interval .............................. 0
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... True
max_position_embeddings ......................... 1024
max_tokens_to_oom ............................... 12000
merge_file ...................................... merges.txt
micro_batch_size ................................ 4
min_loss_scale .................................. 1.0
min_lr .......................................... 1e-05
mmap_bin_files .................................. True
mock_data ....................................... False
moe_aux_loss_coeff .............................. 0.0
moe_grouped_gemm ................................ False
moe_input_jitter_eps ............................ None
moe_per_layer_logging ........................... False
moe_router_load_balancing_type .................. aux_loss
moe_router_topk ................................. 2
moe_token_dispatcher_type ....................... allgather
moe_token_dropping .............................. False
moe_z_loss_coeff ................................ None
nccl_communicator_config_path ................... None
no_load_optim ................................... None
no_load_rng ..................................... None
no_persist_layer_norm ........................... False
no_save_optim ................................... None
no_save_rng ..................................... None
norm_epsilon .................................... 1e-05
normalization ................................... LayerNorm
num_attention_heads ............................. 16
num_channels .................................... 3
num_classes ..................................... 1000
num_experts ..................................... None
num_layers ...................................... 24
num_layers_per_virtual_pipeline_stage ........... None
num_query_groups ................................ 1
num_workers ..................................... 2
one_logger_entity ............................... hwinf_dcm
one_logger_project .............................. e2e-tracking
one_logger_run_name ............................. None
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
output_bert_embeddings .......................... False
overlap_grad_reduce ............................. False
overlap_p2p_comm ................................ False
overlap_param_gather ............................ False
override_opt_param_scheduler .................... False
params_dtype .................................... torch.float16
patch_dim ....................................... 16
perform_initialization .......................... True
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... learned_absolute
pretrained_checkpoint ........................... None
profile ......................................... False
profile_ranks ................................... [0]
profile_step_end ................................ 12
profile_step_start .............................. 10
qk_layernorm .................................... False
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ None
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_attention_gate ............................ 1
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_project_dir ............................... None
retro_verify_neighbor_count ..................... True
rotary_interleaved .............................. False
rotary_percent .................................. 1.0
rotary_seq_len_interpolation_factor ............. None
sample_rate ..................................... 1.0
save ............................................ checkpoints/gpt2_345m_dist_mp
save_interval ................................... 50
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 1024
sequence_parallel ............................... True
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
skip_train ...................................... False
spec ............................................ None
split ........................................... 949,50,1
squared_relu .................................... False
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.01
straggler_ctrlr_port ............................ 65535
straggler_minmax_count .......................... 1
swiglu .......................................... False
swin_backbone_type .............................. tiny
tensor_model_parallel_size ...................... 2
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_mode ....................................... False
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_model ................................. None
tokenizer_type .................................. GPT2BPETokenizer
tp_comm_bulk_dgrad .............................. True
tp_comm_bulk_wgrad .............................. True
tp_comm_overlap ................................. False
tp_comm_overlap_ag .............................. True
tp_comm_overlap_cfg ............................. None
tp_comm_overlap_rs .............................. True
tp_comm_overlap_rs_dgrad ........................ False
tp_comm_split_ag ................................ True
tp_comm_split_rs ................................ True
train_data_path ................................. None
train_iters ..................................... 200
train_samples ................................... None
transformer_impl ................................ transformer_engine
transformer_pipeline_model_parallel_size ........ 1
untie_embeddings_and_output_weights ............. False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_cpu_initialization .......................... None
use_dist_ckpt ................................... False
use_distributed_optimizer ....................... False
use_flash_attn .................................. False
use_mcore_models ................................ False
use_one_sent_docs ............................... False
use_ring_exchange_p2p ........................... False
use_rotary_position_embeddings .................. False
use_tp_pp_dp_mapping ............................ False
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... vocab.json
vocab_size ...................................... None
wandb_exp_name ..................................
wandb_project ...................................
wandb_save_dir ..................................
weight_decay .................................... 0.01
weight_decay_incr_style ......................... constant
world_size ...................................... 2
yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 4
> building GPT2BPETokenizer tokenizer ...
> padded vocab (size: 50257) with 175 dummy tokens (new size: 50432)
> initializing torch distributed ...
> initialized tensor model parallel with size 2
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory `/scratch/qualis/test/Megatron-LM.bak/megatron/core/datasets'
make: Nothing to be done for `default'.
make: Leaving directory `/scratch/qualis/test/Megatron-LM.bak/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.063 seconds
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 3.407 seconds
[rank1]:[W913 14:10:04.146531972 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank0]:[W913 14:10:04.146585101 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
time to initialize megatron (seconds): 5.393
[after megatron is initialized] datetime: 2024-09-13 14:10:06
building GPT model ...
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 178100224
INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with DistributedDataParallelConfig: DistributedDataParallelConfig(grad_reduce_in_fp32=False, overlap_grad_reduce=False, use_distributed_optimizer=False, check_for_nan_in_grad=False, bucket_size=None)
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (178100224 elements):
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.6.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.final_norm.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.embedding.word_embeddings.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.0.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.2.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.14.self_attention.layernorm_qkv.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.layernorm_mlp.fc1_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.9.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.7.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.1.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.final_norm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.layernorm_mlp.fc1_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.layernorm_mlp.layer_norm_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.8.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.12.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.self_attention.layernorm_qkv.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.5.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.3.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.23.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.17.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.embedding.position_embeddings.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.18.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.22.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.21.self_attention.layernorm_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.19.self_attention.proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.16.layernorm_mlp.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.15.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.13.self_attention.proj.bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.20.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.layernorm_mlp.fc2_bias
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.11.layernorm_mlp.fc2_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.10.self_attention.layernorm_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.language_model.encoder.layers.4.layernorm_mlp.layer_norm_bias
INFO:megatron.core.optimizer:Setting up optimizer with OptimizerConfig: OptimizerConfig(optimizer='adam', lr=0.00015, min_lr=1e-05, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.01, fp16=True, bf16=False, params_dtype=torch.float16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=False, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x2b8d69ceb820>)
> learning rate decay style: cosine
> number of parameters on (tensor, pipeline) model parallel rank (1, 0): 178100224
WARNING: could not find the metadata file checkpoints/gpt2_345m_dist_mp/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
(min, max) time across ranks (ms):
load-checkpoint ................................: (0.49, 0.52)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-09-13 14:10:10
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 3200
validation: 480
test: 160
INFO:megatron.core.datasets.blended_megatron_dataset_config:mock = False
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.949), (0.949, 0.999), (0.999, 1.0)]
> building train, validation, and test datasets for GPT ...
WARNING:megatron.core.datasets.blended_megatron_dataset_builder:Building dataset splits with cls=GPTDataset, sizes=(3200, 480, 160), and config=GPTDatasetConfig(random_seed=1234, sequence_length=1024, blend=(['my-gpt2_text_document'], None), blend_per_split=[None, None, None], split='949,50,1', split_matrix=[(0, 0.949), (0.949, 0.999), (0.999, 1.0)], path_to_cache=None, mmap_bin_files=True, mock=False, tokenizer=<megatron.training.tokenizer.tokenizer._GPT2BPETokenizer object at 0x2b8d69d5dc30>, reset_position_ids=False, reset_attention_mask=False, eod_mask_loss=False, create_attention_mask=True)
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from my-gpt2_text_document.idx
INFO:megatron.core.datasets.indexed_dataset: Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset: Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset: Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 10000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 10000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset: Load the document index from a375193ee7bd0ffbfa0d131aff630f11-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the sample index from a375193ee7bd0ffbfa0d131aff630f11-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from a375193ee7bd0ffbfa0d131aff630f11-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 3570
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset valid indices
INFO:megatron.core.datasets.gpt_dataset: Load the document index from e9d835bb86bcaca4cd08b4977794f406-GPTDataset-valid-document_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the sample index from e9d835bb86bcaca4cd08b4977794f406-GPTDataset-valid-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from e9d835bb86bcaca4cd08b4977794f406-GPTDataset-valid-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 530
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset test indices
INFO:megatron.core.datasets.gpt_dataset: Load the document index from fceb23beb87bf7aa15e27415203c9636-GPTDataset-test-document_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the sample index from fceb23beb87bf7aa15e27415203c9636-GPTDataset-test-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from fceb23beb87bf7aa15e27415203c9636-GPTDataset-test-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 161
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-09-13 14:10:10
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (4529.59, 4538.06)
train/valid/test-data-iterators-setup ..........: (34.88, 351.35)
training ...
[before the start of training step] datetime: 2024-09-13 14:10:10
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
handle = torch.distributed._reduce_scatter_base(
[2024-09-13 14:10:23] iteration 10/ 200 | consumed samples: 160 | elapsed time per iteration (ms): 1216.5 | learning rate: 0.000000E+00 | global batch size: 16 | loss scale: 8388608.0 | number of skipped iterations: 10 | number of nan iterations: 0 |
Number of parameters in transformer layers in billions: 0.30
[2024-09-13 14:10:28] iteration 20/ 200 | consumed samples: 320 | elapsed time per iteration (ms): 488.6 | learning rate: 2.343750E-07 | global batch size: 16 | lm loss: 1.105440E+01 | loss scale: 262144.0 | grad norm: 24.377 | number of skipped iterations: 5 | number of nan iterations: 0 |Number of parameters in embedding layers in billions: 0.05
Total number of parameters in billions: 0.35
Number of parameters in most loaded shard in billions: 0.1769
Theoretical memory footprints: weight and optimizer=3036.11 MB
[Rank 1] (after 20 iterations) memory (MB) | allocated: 3431.298828125 | max allocated: 5484.498046875 | reserved: 5512.0 | max reserved: 5512.0
[Rank 0] (after 20 iterations) memory (MB) | allocated: 3431.298828125 | max allocated: 5484.498046875 | reserved: 5512.0 | max reserved: 5512.0
[2024-09-13 14:10:32] iteration 30/ 200 | consumed samples: 480 | elapsed time per iteration (ms): 438.3 | learning rate: 7.031250E-07 | global batch size: 16 | lm loss: 1.076019E+01 | loss scale: 262144.0 | grad norm: 18.259 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:10:36] iteration 40/ 200 | consumed samples: 640 | elapsed time per iteration (ms): 429.6 | learning rate: 1.171875E-06 | global batch size: 16 | lm loss: 1.007059E+01 | loss scale: 262144.0 | grad norm: 8.208 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:10:41] iteration 50/ 200 | consumed samples: 800 | elapsed time per iteration (ms): 435.3 | learning rate: 1.640625E-06 | global batch size: 16 | lm loss: 9.590485E+00 | loss scale: 262144.0 | grad norm: 4.108 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 50 to checkpoints/gpt2_345m_dist_mp in torch format
successfully saved checkpoint at iteration 50 to checkpoints/gpt2_345m_dist_mp
(min, max) time across ranks (ms):
save-checkpoint ................................: (2592.46, 2592.53)
[2024-09-13 14:10:47] iteration 60/ 200 | consumed samples: 960 | elapsed time per iteration (ms): 433.3 | learning rate: 2.109375E-06 | global batch size: 16 | lm loss: 9.360849E+00 | loss scale: 262144.0 | grad norm: 3.096 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:10:52] iteration 70/ 200 | consumed samples: 1120 | elapsed time per iteration (ms): 435.1 | learning rate: 2.578125E-06 | global batch size: 16 | lm loss: 9.242426E+00 | loss scale: 262144.0 | grad norm: 2.803 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:10:56] iteration 80/ 200 | consumed samples: 1280 | elapsed time per iteration (ms): 424.8 | learning rate: 3.046875E-06 | global batch size: 16 | lm loss: 9.108851E+00 | loss scale: 262144.0 | grad norm: 3.376 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:11:01] iteration 90/ 200 | consumed samples: 1440 | elapsed time per iteration (ms): 446.8 | learning rate: 3.515625E-06 | global batch size: 16 | lm loss: 8.897541E+00 | loss scale: 262144.0 | grad norm: 2.991 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:11:05] iteration 100/ 200 | consumed samples: 1600 | elapsed time per iteration (ms): 432.7 | learning rate: 3.984375E-06 | global batch size: 16 | lm loss: 8.763078E+00 | loss scale: 262144.0 | grad norm: 2.993 | number of skipped iterations: 0 | number of nan iterations: 0 |
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:178: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:178: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:178: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:178: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
(min, max) time across ranks (ms):
evaluate .......................................: (1718.39, 1718.52)
-----------------------------------------------------------------------------------------------
validation loss at iteration 100 | lm loss value: 8.691130E+00 | lm loss PPL: 5.949900E+03 |
-----------------------------------------------------------------------------------------------
saving checkpoint at iteration 100 to checkpoints/gpt2_345m_dist_mp in torch format
successfully saved checkpoint at iteration 100 to checkpoints/gpt2_345m_dist_mp
(min, max) time across ranks (ms):
save-checkpoint ................................: (2515.99, 2516.06)
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:162: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/scratch/qualis/test/Megatron-LM.bak/megatron/core/tensor_parallel/layers.py:431: FutureWarning: `torch.distributed._reduce_scatter_base` is a private function and will be deprecated. Please use `torch.distributed.reduce_scatter_tensor` instead.
handle = torch.distributed._reduce_scatter_base(
/scratch/qualis/miniconda3/envs/megatron/lib/python3.10/site-packages/transformer_engine/pytorch/jit.py:121: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
[2024-09-13 14:11:13] iteration 110/ 200 | consumed samples: 1760 | elapsed time per iteration (ms): 433.8 | learning rate: 4.453125E-06 | global batch size: 16 | lm loss: 8.633624E+00 | loss scale: 262144.0 | grad norm: 2.537 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:11:18] iteration 120/ 200 | consumed samples: 1920 | elapsed time per iteration (ms): 439.2 | learning rate: 4.921875E-06 | global batch size: 16 | lm loss: 8.542423E+00 | loss scale: 262144.0 | grad norm: 2.307 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:11:22] iteration 130/ 200 | consumed samples: 2080 | elapsed time per iteration (ms): 438.9 | learning rate: 5.390625E-06 | global batch size: 16 | lm loss: 8.467690E+00 | loss scale: 262144.0 | grad norm: 2.970 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:11:27] iteration 140/ 200 | consumed samples: 2240 | elapsed time per iteration (ms): 431.8 | learning rate: 5.859375E-06 | global batch size: 16 | lm loss: 8.388003E+00 | loss scale: 262144.0 | grad norm: 2.117 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:11:31] iteration 150/ 200 | consumed samples: 2400 | elapsed time per iteration (ms): 423.6 | learning rate: 6.328125E-06 | global batch size: 16 | lm loss: 8.318639E+00 | loss scale: 262144.0 | grad norm: 2.781 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 150 to checkpoints/gpt2_345m_dist_mp in torch format
successfully saved checkpoint at iteration 150 to checkpoints/gpt2_345m_dist_mp
(min, max) time across ranks (ms):
save-checkpoint ................................: (2453.37, 2453.39)
[2024-09-13 14:11:37] iteration 160/ 200 | consumed samples: 2560 | elapsed time per iteration (ms): 422.0 | learning rate: 6.796875E-06 | global batch size: 16 | lm loss: 8.229613E+00 | loss scale: 262144.0 | grad norm: 1.963 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:11:42] iteration 170/ 200 | consumed samples: 2720 | elapsed time per iteration (ms): 422.5 | learning rate: 7.265625E-06 | global batch size: 16 | lm loss: 8.162241E+00 | loss scale: 262144.0 | grad norm: 2.579 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:11:46] iteration 180/ 200 | consumed samples: 2880 | elapsed time per iteration (ms): 436.7 | learning rate: 7.734375E-06 | global batch size: 16 | lm loss: 8.066425E+00 | loss scale: 262144.0 | grad norm: 2.065 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:11:50] iteration 190/ 200 | consumed samples: 3040 | elapsed time per iteration (ms): 421.9 | learning rate: 8.203125E-06 | global batch size: 16 | lm loss: 8.001675E+00 | loss scale: 262144.0 | grad norm: 2.096 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-09-13 14:11:54] iteration 200/ 200 | consumed samples: 3200 | elapsed time per iteration (ms): 422.7 | learning rate: 8.671875E-06 | global batch size: 16 | lm loss: 7.900609E+00 | loss scale: 262144.0 | grad norm: 2.095 | number of skipped iterations: 0 | number of nan iterations: 0 |
(min, max) time across ranks (ms):
evaluate .......................................: (1599.28, 1599.36)
-----------------------------------------------------------------------------------------------
validation loss at iteration 200 | lm loss value: 7.888445E+00 | lm loss PPL: 2.666296E+03 |
-----------------------------------------------------------------------------------------------
saving checkpoint at iteration 200 to checkpoints/gpt2_345m_dist_mp in torch format
successfully saved checkpoint at iteration 200 to checkpoints/gpt2_345m_dist_mp
(min, max) time across ranks (ms):
save-checkpoint ................................: (2607.55, 2607.58)
[after training is done] datetime: 2024-09-13 14:11:59
Evaluating on 160 samples
Evaluating iter 1/10
Evaluating iter 2/10
Evaluating iter 3/10
Evaluating iter 4/10
Evaluating iter 5/10
Evaluating iter 6/10
Evaluating iter 7/10
Evaluating iter 8/10
Evaluating iter 9/10
Evaluating iter 10/10
(min, max) time across ranks (ms):
evaluate .......................................: (1603.27, 1603.32)
-----------------------------------------------------------------------------------------------------------------
validation loss at iteration 200 on validation set | lm loss value: 7.889119E+00 | lm loss PPL: 2.668093E+03 |
-----------------------------------------------------------------------------------------------------------------
Evaluating on 160 samples
Evaluating iter 1/10
Evaluating iter 2/10
Evaluating iter 3/10
Evaluating iter 4/10
Evaluating iter 5/10
Evaluating iter 6/10
Evaluating iter 7/10
Evaluating iter 8/10
Evaluating iter 9/10
Evaluating iter 10/10
(min, max) time across ranks (ms):
evaluate .......................................: (1639.27, 1639.28)
-----------------------------------------------------------------------------------------------------------
validation loss at iteration 200 on test set | lm loss value: 7.695062E+00 | lm loss PPL: 2.197469E+03 |
-----------------------------------------------------------------------------------------------------------
[rank1]:[W913 14:12:02.636544709 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[rank0]:[W913 14:12:02.670127115 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
My simple workaround was checking out an old branch and ran it again. It worked! I don't know how it is workding. Any comment would be appreciated.
(megatron) $ git checkout core_r0.5.0
Branch core_r0.5.0 set up to track remote branch core_r0.5.0 from origin.
Switched to a new branch 'core_r0.5.0'
(megatron) $ git branch
* core_r0.5.0
main
(megatron) $ CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 2 pretrain_gpt.py --tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 4 --global-batch-size 16 --lr 0.00015 --train-iters 200 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --lr-warmup-fraction .01 --clip-grad 1.0 --fp16 --data-path my-gpt2_text_document --vocab-file vocab.json --merge-file merges.txt --split 949,50,1 --log-interval 10 --save-interval 50 --eval-interval 100 --eval-iters 10 --distributed-backend nccl --save checkpoints/gpt2_345m_dist_mp --load checkpoints/gpt2_345m_dist_mp --attention-softmax-in-fp32 --sequence-parallel
Same issue here, Failed with :
[rank21]: Traceback (most recent call last):
[rank21]: File "Megatron-LM/pretrain_gpt.py", line 264, in <module>
[rank21]: pretrain(
[rank21]: File "Megatron-LM/megatron/training/training.py", line 355, in pretrain
[rank21]: iteration, num_floating_point_operations_so_far = train(
[rank21]: ^^^^^^
[rank21]: File "Megatron-LM/megatron/training/training.py", line 1368, in train
[rank21]: save_checkpoint_and_time(iteration, model, optimizer,
[rank21]: File "Megatron-LM/megatron/training/training.py", line 1072, in save_checkpoint_and_time
[rank21]: save_checkpoint(iteration, model, optimizer, opt_param_scheduler,
[rank21]: File "Megatron-LM/megatron/training/checkpointing.py", line 401, in save_checkpoint
[rank21]: state_dict = generate_state_dict(
[rank21]: ^^^^^^^^^^^^^^^^^^^^
[rank21]: File "Megatron-LM/megatron/training/checkpointing.py", line 613, in generate_state_dict
[rank21]: state_dict['optimizer'] = (optimizer.sharded_state_dict(state_dict, **(optim_sd_kwargs or {}))
[rank21]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank21]: File "Megatron-LM/megatron/core/optimizer/optimizer.py", line 654, in sharded_state_dict
[rank21]: optim_state_to_sharding_state(
[rank21]: File "Megatron-LM/megatron/core/dist_checkpointing/optimizer.py", line 120, in optim_state_to_sharding_state
[rank21]: sharded_state[param_id][state_key] = make_sharded_optimizer_tensor(
[rank21]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank21]: File "Megatron-LM/megatron/core/dist_checkpointing/optimizer.py", line 83, in make_sharded_optimizer_tensor
[rank21]: tuple(optim_param.shape) == model_param.local_shape
[rank21]: ^^^^^^^^^^^^^^^^^
[rank21]: AttributeError: 'NoneType' object has no attribute 'shape'
Any help ? I can try to fix it but would like some insight to get started I don't want to downgrade as I want to benchmark against Mamba....
I tried using --use-distributed-optimizer
but failed also on a error :
[rank21]: File "Megatron-LM/megatron/core/optimizer/distrib_optimizer.py", line 1159, in sharded_param_state_fs_model_space
[rank21]: dtype=state_ten.dtype,
[rank21]: ^^^^^^^^^^^^^^^
[rank21]: AttributeError: 'NoneType' object has no attribute 'dtype'
Looks like the two error are link !
I changed the checkpoint format from torch_dist
to torch
and seems to do the work, I haven't tried to restart the training from a backup but no error throw during model saving
Yes, it seemed to work for me as well by setting the command argument --ckpt-format
to torch
explicitly. BTW, the default checkpoint format is torch_dist
.
(megatron) $ CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 2 --master_port 12345 pretrain_gpt.py --tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 4 --global-batch-size 16 --lr 0.00015 --train-iters 200 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --lr-warmup-fraction .01 --clip-grad 1.0 --fp16 --data-path my-gpt2_text_document --vocab-file vocab.json --merge-file merges.txt --split 949,50,1 --log-interval 10 --save-interval 50 --eval-interval 100 --eval-iters 10 --distributed-backend nccl --save checkpoints/gpt2_345m_dist_mp --load checkpoints/gpt2_345m_dist_mp --attention-softmax-in-fp32 --sequence-parallel --ckpt-format torch . . . [2024-09-23 08:51:58] iteration 50/ 200 | consumed samples: 800 | elapsed time per iteration (ms): 1133.0 | learning rate: 1.640625E-06 | global batch size: 16 | lm loss: 9.557187E+00 | loss scale: 262144.0 | grad norm: 3.954 | number of skipped iterations: 0 | number of nan iterations: 0 | saving checkpoint at iteration 50 to checkpoints/gpt2_345m_dist_mp in torch format successfully saved checkpoint from iteration 50 to checkpoints/gpt2_345m_dist_mp (min, max) time across ranks (ms): save-checkpoint ................................: (3445.21, 3445.43) . . .
I tried using
--use-distributed-optimizer
but failed also on a error :[rank21]: File "Megatron-LM/megatron/core/optimizer/distrib_optimizer.py", line 1159, in sharded_param_state_fs_model_space [rank21]: dtype=state_ten.dtype, [rank21]: ^^^^^^^^^^^^^^^ [rank21]: AttributeError: 'NoneType' object has no attribute 'dtype'
Looks like the two error are link !
install newer TE
Hi, I tryed by installing the lastest version with :
pip install git+https://github.com/NVIDIA/TransformerEngine.git@main
It might work but now the training just don't start and I have a new error :
TypeError: flash_attn_func() got an unexpected keyword argument 'block_table'
I will wait for a new stable release to try again
Hi, I tryed by installing the lastest version with :
pip install git+https://github.com/NVIDIA/TransformerEngine.git@main
It might work but now the training just don't start and I have a new error :
TypeError: flash_attn_func() got an unexpected keyword argument 'block_table'
I will wait for a new stable release to try again
Same issue. I think it is originated from the Transformer Engine (TE). I am able to reproduce the bug with Megatron-Core r0.9.0 and TE v1.10 in training a mixtral model. I found a PR to resolve the issue in TE#1130. This PR is included in TE v1.11. Perhaps you could try to upgrade the TE version to 1.11.
Indeed, it works with transformer engine v1.11 thanks
Hi, It seems that the same code is **working fine with when the Megatron-LM that I git-cloned in April. With the latest Megatron-LM, I've got the following error raised with the pretrain_gpt.py code. It seems that the Megatron cores codes have been upgraded since April, Note that for some testing purpose, I set --save-interval argument is set to be 50 below.
Any comment or suggestion would be appreciated.