NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.28k stars 2.09k forks source link

[BUG] RoPE embeddings lead to error with distributed training #588

Closed ia3leonidshad closed 7 months ago

ia3leonidshad commented 7 months ago

Describe the bug Using RoPE embeddings lead to NCCL error when training on 2 GPUs or more. Bug was introduced in this commit: https://github.com/NVIDIA/Megatron-LM/commit/0c2074e2bdfca3a2a1ad5957838e4209e141a93c#diff-a76c01a5dcf342ac5c484ff276e6cd91de4756f1fc17125131e7c8a2badb1fee When rolling back before it (and fixing some import and args bugs) RoPE works.

To Reproduce

GPT_ARGS="
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --position-embedding-type rope \
    --micro-batch-size 4 \
    --global-batch-size 64 \
    --lr 0.00015 \
    --train-iters 500000 \
    --lr-decay-iters 320000 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction .01 \
    --clip-grad 1.0 \
    --fp16 \
    --data-cache-path /home/lekimov/experiments/index-folder \
"
DISTRIBUTED_ARGS="
    --nproc_per_node 2 \
    --nnodes 1 \
    --node_rank 0 \
    --master_addr localhost \
    --master_port 2020
"

Stack trace/logs

Traceback (most recent call last):
  File "pretrain_gpt.py", line 230, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/home/lekimov/Megatron-LM-latest/megatron/training.py", line 160, in pretrain
Traceback (most recent call last):
  File "pretrain_gpt.py", line 230, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/home/lekimov/Megatron-LM-latest/megatron/training.py", line 160, in pretrain
    iteration = train(forward_step_func,
  File "/home/lekimov/Megatron-LM-latest/megatron/training.py", line 748, in train
    train_step(forward_step_func,
  File "/home/lekimov/Megatron-LM-latest/megatron/training.py", line 424, in train_step
    losses_reduced = forward_backward_func(
  File "/home/lekimov/Megatron-LM-latest/megatron/core/pipeline_parallel/schedules.py", line 362, in forward_backward_no_pipelining
    iteration = train(forward_step_func,
  File "/home/lekimov/Megatron-LM-latest/megatron/training.py", line 748, in train
    config.finalize_model_grads_func([model])
  File "/home/lekimov/Megatron-LM-latest/megatron/core/distributed/finalize_model_grads.py", line 129, in finalize_model_grads
    train_step(forward_step_func,
  File "/home/lekimov/Megatron-LM-latest/megatron/training.py", line 424, in train_step
    model_chunk.finish_grad_sync()
  File "/home/lekimov/Megatron-LM-latest/megatron/core/distributed/distributed_data_parallel.py", line 192, in finish_grad_sync
    losses_reduced = forward_backward_func(
  File "/home/lekimov/Megatron-LM-latest/megatron/core/pipeline_parallel/schedules.py", line 362, in forward_backward_no_pipelining
    grad_buffer.finish_grad_sync()
  File "/home/lekimov/Megatron-LM-latest/megatron/core/distributed/grad_buffer.py", line 383, in finish_grad_sync
    config.finalize_model_grads_func([model])
  File "/home/lekimov/Megatron-LM-latest/megatron/core/distributed/finalize_model_grads.py", line 129, in finalize_model_grads
    bucket.finish_grad_sync()
  File "/home/lekimov/Megatron-LM-latest/megatron/core/distributed/grad_buffer.py", line 123, in finish_grad_sync
    model_chunk.finish_grad_sync()
  File "/home/lekimov/Megatron-LM-latest/megatron/core/distributed/distributed_data_parallel.py", line 192, in finish_grad_sync
    self.start_grad_sync()
  File "/home/lekimov/Megatron-LM-latest/megatron/core/distributed/grad_buffer.py", line 108, in start_grad_sync
    grad_buffer.finish_grad_sync()
  File "/home/lekimov/Megatron-LM-latest/megatron/core/distributed/grad_buffer.py", line 383, in finish_grad_sync
    bucket.finish_grad_sync()
  File "/home/lekimov/Megatron-LM-latest/megatron/core/distributed/grad_buffer.py", line 123, in finish_grad_sync
    self.start_grad_sync()
  File "/home/lekimov/Megatron-LM-latest/megatron/core/distributed/grad_buffer.py", line 108, in start_grad_sync
    self.communication_handle = torch.distributed.all_reduce(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'

Environment (please complete the following information):

Proposed fix Might be connected to: https://github.com/NVIDIA/Megatron-LM/issues/560

Additional context Without rotary embeddings everything works fine.

wdykas commented 7 months ago

could you run with NCCL_DEBUG=INFO? We have been able to run RoPe without issues.

ia3leonidshad commented 7 months ago

I cannot reproduce it anymore. I'll close the issue.

980202006 commented 4 months ago

@wdykas I met same error.I installed transformerEngine and it does not work.Then I uninstalled it.And I got this error.

[2024-03-01 00:58:32,151] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-01 00:58:42,368] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-03-01 00:58:42,369] [INFO] [runner.py:570:main] cmd = /home/www/anaconda3/envs/cuda11/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMywgNCwgNSwgNiwgOCwgOV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /home/www/models/gpt/megatron_lm/trains/train_scaled_v55.py --tensor-model-parallel-size 2 --sequence-parallel --use-flash-attn --optimizer adam --recompute-activations --num-layers 64 --hidden-size 3072 --num-attention-heads 32 --seq-length 3400 --max-position-embeddings 4000 --micro-batch-size 1 --global-batch-size 120 --lr 0.0001 --train-iters 500000 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --bf16 --data-path $DATA_PATH --vocab-file $VOCAB_FILE --merge-file $MERGE_FILE --split 999,1,1 --log-interval 10 --save-interval 1000 --eval-interval 10000 --eval-iters 1 --tokenizer-type t5 --untie-embeddings-and-output-weights --use-rotary-position-embeddings --swiglu --save /data3/www/checkpoints/dones/10b_4_2 --dataloader-type cyclic --load /data3/www/checkpoints/dones/10b_4_1 --finetune --initial-loss-scale 8192 --tensorboard-queue-size 1 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --tensorboard-dir /data3/www/checkpoints/dones/10b_4_1 --use-distributed-optimizer --spec local
[2024-03-01 00:58:44,665] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-01 00:58:45,605] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 3, 4, 5, 6, 8, 9]}
[2024-03-01 00:58:45,605] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-03-01 00:58:45,606] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-03-01 00:58:45,606] [INFO] [launch.py:163:main] dist_world_size=8
[2024-03-01 00:58:45,606] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,3,4,5,6,8,9
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
[2024-03-01 00:58:49,556] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-01 00:58:49,557] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-01 00:58:49,608] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-01 00:58:49,611] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
############# None
############# None
############# None
############# None
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
[2024-03-01 00:58:49,929] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
> setting tensorboard ...
[2024-03-01 00:58:49,991] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
[2024-03-01 00:58:50,038] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-01 00:58:50,116] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
############# None
############# None
############# None
############# None
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
using world size: 8, data-parallel size: 4, context-parallel size: 1 tensor-model-parallel size: 2, pipeline-model-parallel size: 1 
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... True
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ True
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  bias_swiglu_fusion .............................. True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. True
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 4
  data_path ....................................... ['$DATA_PATH']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. cyclic
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  delay_grad_reduce ............................... True
  delay_param_gather .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  enable_one_logger ............................... False
  encoder_num_layers .............................. 64
  encoder_seq_length .............................. 3400
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 10000
  eval_iters ...................................... 1
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  ffn_hidden_size ................................. 8192
  finetune ........................................ True
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 120
  gradient_accumulation_fusion .................... True
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 3072
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 8192.0
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 96
  lazy_mpu_init ................................... None
  load ............................................ /data3/www/checkpoints/dones/10b_4_1
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... True
  log_interval .................................... 10
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_progress .................................... False
  log_throughput .................................. False
  log_timers_to_tensorboard ....................... True
  log_validation_ppl_to_tensorboard ............... True
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.0001
  lr_decay_iters .................................. 320000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. None
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 4000
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... $MERGE_FILE
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-05
  mock_data ....................................... False
  moe_aux_loss_coeff .............................. 0.0
  moe_grouped_gemm ................................ False
  moe_input_jitter_eps ............................ None
  moe_router_load_balancing_type .................. aux_loss
  moe_router_topk ................................. 2
  moe_token_dropping .............................. False
  moe_z_loss_coeff ................................ None
  nccl_communicator_config_path ................... None
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... LayerNorm
  num_attention_heads ............................. 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_layers ...................................... 64
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_workers ..................................... 2
  one_logger_entity ............................... hwinf_dcm
  one_logger_project .............................. e2e-tracking
  one_logger_run_name ............................. None
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.bfloat16
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... rope
  profile ......................................... False
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... selective
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_attention_gate ............................ 1
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_verify_neighbor_count ..................... True
  retro_workdir ................................... None
  rotary_interleaved .............................. False
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ /data3/www/checkpoints/dones/10b_4_2
  save_interval ................................... 1000
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 3400
  sequence_parallel ............................... True
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  spec ............................................ ['local']
  split ........................................... 999,1,1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  swiglu .......................................... True
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 2
  tensorboard_dir ................................. /data3/www/checkpoints/dones/10b_4_1
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1
  test_data_path .................................. None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. t5
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_cfg ............................. None
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... 500000
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. True
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_distributed_optimizer ....................... True
  use_flash_attn .................................. True
  use_mcore_models ................................ False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. True
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... $VOCAB_FILE
  vocab_size ...................................... None
  wandb_exp_name .................................. 
  wandb_project ................................... 
  wandb_save_dir .................................. 
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 8
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 30
> building t5 tokenizer ...
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 > padded vocab (size: 32596) with 172 dummy tokens (new size: 32768)
> initializing torch distributed ...
> done: initializing torch distributed ...
> initialized tensor model parallel with size 2
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/home/www/models/gpt/megatron_lm/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/www/models/gpt/megatron_lm/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.062 seconds
> compiling and loading fused kernels ...
ps-SYS-420GP-TNR:2816644:2816644 [0] NCCL INFO Bootstrap : Using eno1:192.168.223.11<0>
ps-SYS-420GP-TNR:2816644:2816644 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ps-SYS-420GP-TNR:2816644:2816644 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ps-SYS-420GP-TNR:2816644:2816644 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.5+cuda11.8
ps-SYS-420GP-TNR:2816651:2816651 [7] NCCL INFO cudaDriverVersion 12020
ps-SYS-420GP-TNR:2816647:2816647 [3] NCCL INFO cudaDriverVersion 12020
ps-SYS-420GP-TNR:2816649:2816649 [5] NCCL INFO cudaDriverVersion 12020
ps-SYS-420GP-TNR:2816645:2816645 [1] NCCL INFO cudaDriverVersion 12020
ps-SYS-420GP-TNR:2816648:2816648 [4] NCCL INFO cudaDriverVersion 12020
ps-SYS-420GP-TNR:2816646:2816646 [2] NCCL INFO cudaDriverVersion 12020
ps-SYS-420GP-TNR:2816650:2816650 [6] NCCL INFO cudaDriverVersion 12020
ps-SYS-420GP-TNR:2816648:2816648 [4] NCCL INFO Bootstrap : Using eno1:192.168.223.11<0>
ps-SYS-420GP-TNR:2816649:2816649 [5] NCCL INFO Bootstrap : Using eno1:192.168.223.11<0>
ps-SYS-420GP-TNR:2816651:2816651 [7] NCCL INFO Bootstrap : Using eno1:192.168.223.11<0>
ps-SYS-420GP-TNR:2816650:2816650 [6] NCCL INFO Bootstrap : Using eno1:192.168.223.11<0>
ps-SYS-420GP-TNR:2816645:2816645 [1] NCCL INFO Bootstrap : Using eno1:192.168.223.11<0>
ps-SYS-420GP-TNR:2816647:2816647 [3] NCCL INFO Bootstrap : Using eno1:192.168.223.11<0>
ps-SYS-420GP-TNR:2816646:2816646 [2] NCCL INFO Bootstrap : Using eno1:192.168.223.11<0>
ps-SYS-420GP-TNR:2816648:2816648 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ps-SYS-420GP-TNR:2816648:2816648 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ps-SYS-420GP-TNR:2816651:2816651 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ps-SYS-420GP-TNR:2816651:2816651 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ps-SYS-420GP-TNR:2816649:2816649 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ps-SYS-420GP-TNR:2816649:2816649 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ps-SYS-420GP-TNR:2816650:2816650 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ps-SYS-420GP-TNR:2816650:2816650 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ps-SYS-420GP-TNR:2816645:2816645 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ps-SYS-420GP-TNR:2816647:2816647 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ps-SYS-420GP-TNR:2816645:2816645 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ps-SYS-420GP-TNR:2816647:2816647 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ps-SYS-420GP-TNR:2816646:2816646 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ps-SYS-420GP-TNR:2816646:2816646 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO NET/IB : No device found.
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO NET/Socket : Using [0]eno1:192.168.223.11<0> [1]usb0:169.254.3.1<0>
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO NET/IB : No device found.
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO NET/Socket : Using [0]eno1:192.168.223.11<0> [1]usb0:169.254.3.1<0>
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO NET/IB : No device found.
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.223.11<0> [1]usb0:169.254.3.1<0>
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO NET/IB : No device found.
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO NET/IB : No device found.
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO NET/Socket : Using [0]eno1:192.168.223.11<0> [1]usb0:169.254.3.1<0>
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO NET/Socket : Using [0]eno1:192.168.223.11<0> [1]usb0:169.254.3.1<0>
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO NET/IB : No device found.
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO NET/Socket : Using [0]eno1:192.168.223.11<0> [1]usb0:169.254.3.1<0>
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO NET/IB : No device found.
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO NET/Socket : Using [0]eno1:192.168.223.11<0> [1]usb0:169.254.3.1<0>
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO NET/IB : No device found.
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO NET/Socket : Using [0]eno1:192.168.223.11<0> [1]usb0:169.254.3.1<0>
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO comm 0xa49d350 rank 6 nranks 8 cudaDev 6 nvmlDev 8 busId d5000 commId 0xcb4c9dccde6a2dc9 - Init START
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO comm 0xaadd840 rank 5 nranks 8 cudaDev 5 nvmlDev 6 busId d1000 commId 0xcb4c9dccde6a2dc9 - Init START
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO comm 0xa5829e0 rank 4 nranks 8 cudaDev 4 nvmlDev 5 busId ce000 commId 0xcb4c9dccde6a2dc9 - Init START
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO comm 0xa019500 rank 3 nranks 8 cudaDev 3 nvmlDev 4 busId 57000 commId 0xcb4c9dccde6a2dc9 - Init START
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO comm 0xb59fb40 rank 2 nranks 8 cudaDev 2 nvmlDev 3 busId 56000 commId 0xcb4c9dccde6a2dc9 - Init START
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO comm 0xa0f80d0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 52000 commId 0xcb4c9dccde6a2dc9 - Init START
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO comm 0x9aa9c50 rank 7 nranks 8 cudaDev 7 nvmlDev 9 busId d6000 commId 0xcb4c9dccde6a2dc9 - Init START
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO comm 0x9a01be0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 4f000 commId 0xcb4c9dccde6a2dc9 - Init START
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO Setting affinity for GPU 8 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO Setting affinity for GPU 9 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO Channel 00 : 7[9] -> 0[0] via SHM/direct/direct
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO Channel 00 : 1[1] -> 2[3] via SHM/direct/direct
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO Channel 00 : 3[4] -> 4[5] via SHM/direct/direct
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO Channel 01 : 7[9] -> 0[0] via SHM/direct/direct
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO Channel 01 : 1[1] -> 2[3] via SHM/direct/direct
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO Channel 01 : 3[4] -> 4[5] via SHM/direct/direct
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO Channel 00 : 5[6] -> 6[8] via SHM/direct/direct
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO Channel 01 : 5[6] -> 6[8] via SHM/direct/direct
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO Channel 00/0 : 4[5] -> 5[6] via P2P/IPC
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO Channel 00/0 : 2[3] -> 3[4] via P2P/IPC
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO Channel 01/0 : 4[5] -> 5[6] via P2P/IPC
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO Channel 00/0 : 6[8] -> 7[9] via P2P/IPC
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO Channel 01/0 : 2[3] -> 3[4] via P2P/IPC
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO Channel 01/0 : 6[8] -> 7[9] via P2P/IPC
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO Channel 00 : 4[5] -> 3[4] via SHM/direct/direct
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO Channel 01 : 4[5] -> 3[4] via SHM/direct/direct
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO Channel 00 : 6[8] -> 5[6] via SHM/direct/direct
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO Channel 00/0 : 7[9] -> 6[8] via P2P/IPC
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO Channel 01 : 6[8] -> 5[6] via SHM/direct/direct
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO Channel 01/0 : 7[9] -> 6[8] via P2P/IPC
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO Channel 00 : 2[3] -> 1[1] via SHM/direct/direct
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO Channel 01 : 2[3] -> 1[1] via SHM/direct/direct
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO Channel 00/0 : 5[6] -> 4[5] via P2P/IPC
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO Channel 01/0 : 5[6] -> 4[5] via P2P/IPC
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO Channel 00/0 : 3[4] -> 2[3] via P2P/IPC
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO Channel 01/0 : 3[4] -> 2[3] via P2P/IPC
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816646:2817100 [2] NCCL INFO comm 0xb59fb40 rank 2 nranks 8 cudaDev 2 nvmlDev 3 busId 56000 commId 0xcb4c9dccde6a2dc9 - Init COMPLETE
ps-SYS-420GP-TNR:2816648:2817097 [4] NCCL INFO comm 0xa5829e0 rank 4 nranks 8 cudaDev 4 nvmlDev 5 busId ce000 commId 0xcb4c9dccde6a2dc9 - Init COMPLETE
ps-SYS-420GP-TNR:2816644:2817094 [0] NCCL INFO comm 0x9a01be0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 4f000 commId 0xcb4c9dccde6a2dc9 - Init COMPLETE
ps-SYS-420GP-TNR:2816650:2817096 [6] NCCL INFO comm 0xa49d350 rank 6 nranks 8 cudaDev 6 nvmlDev 8 busId d5000 commId 0xcb4c9dccde6a2dc9 - Init COMPLETE
ps-SYS-420GP-TNR:2816651:2817095 [7] NCCL INFO comm 0x9aa9c50 rank 7 nranks 8 cudaDev 7 nvmlDev 9 busId d6000 commId 0xcb4c9dccde6a2dc9 - Init COMPLETE
ps-SYS-420GP-TNR:2816649:2817098 [5] NCCL INFO comm 0xaadd840 rank 5 nranks 8 cudaDev 5 nvmlDev 6 busId d1000 commId 0xcb4c9dccde6a2dc9 - Init COMPLETE
ps-SYS-420GP-TNR:2816645:2817101 [1] NCCL INFO comm 0xa0f80d0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 52000 commId 0xcb4c9dccde6a2dc9 - Init COMPLETE
ps-SYS-420GP-TNR:2816647:2817099 [3] NCCL INFO comm 0xa019500 rank 3 nranks 8 cudaDev 3 nvmlDev 4 busId 57000 commId 0xcb4c9dccde6a2dc9 - Init COMPLETE
>>> done with compiling and loading fused kernels. Compilation time: 1.004 seconds
/home/www/models/gpt/megatron_lm/megatron/initialize.py:355: UserWarning: nvfuser integration in TorchScript is deprecated. (Triggered internally at ../torch/csrc/jit/codegen/cuda/interface.cpp:235.)
  output = bias_gelu(bias, input)
/home/www/models/gpt/megatron_lm/megatron/initialize.py:355: UserWarning: nvfuser integration in TorchScript is deprecated. (Triggered internally at ../torch/csrc/jit/codegen/cuda/interface.cpp:235.)
  output = bias_gelu(bias, input)
/home/www/models/gpt/megatron_lm/megatron/initialize.py:355: UserWarning: nvfuser integration in TorchScript is deprecated. (Triggered internally at ../torch/csrc/jit/codegen/cuda/interface.cpp:235.)
  output = bias_gelu(bias, input)
/home/www/models/gpt/megatron_lm/megatron/initialize.py:355: UserWarning: nvfuser integration in TorchScript is deprecated. (Triggered internally at ../torch/csrc/jit/codegen/cuda/interface.cpp:235.)
  output = bias_gelu(bias, input)
/home/www/models/gpt/megatron_lm/megatron/initialize.py:355: UserWarning: nvfuser integration in TorchScript is deprecated. (Triggered internally at ../torch/csrc/jit/codegen/cuda/interface.cpp:235.)
  output = bias_gelu(bias, input)
/home/www/models/gpt/megatron_lm/megatron/initialize.py:355: UserWarning: nvfuser integration in TorchScript is deprecated. (Triggered internally at ../torch/csrc/jit/codegen/cuda/interface.cpp:235.)
  output = bias_gelu(bias, input)
/home/www/models/gpt/megatron_lm/megatron/initialize.py:355: UserWarning: nvfuser integration in TorchScript is deprecated. (Triggered internally at ../torch/csrc/jit/codegen/cuda/interface.cpp:235.)
  output = bias_gelu(bias, input)
/home/www/models/gpt/megatron_lm/megatron/initialize.py:355: UserWarning: nvfuser integration in TorchScript is deprecated. (Triggered internally at ../torch/csrc/jit/codegen/cuda/interface.cpp:235.)
  output = bias_gelu(bias, input)
time to initialize megatron (seconds): 4.213
[after megatron is initialized] datetime: 2024-03-01 00:58:53 
building GPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 3711930496
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 3711930496
> learning rate decay style: cosine
 loading checkpoint from /data3/www/checkpoints/dones/10b_4_1 at iteration 1000
could not find arguments in the checkpoint ...
 checkpoint version 3.0
  successfully loaded checkpoint from /data3/www/checkpoints/dones/10b_4_1 at iteration 0
(min, max) time across ranks (ms):
    load-checkpoint ................................: (6060.69, 6060.84)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-03-01 00:59:00 
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      60000000
    validation: 6120
    test:       120
[after dataloaders are built] datetime: 2024-03-01 01:00:54 
done with setup ...
training ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (6730.41, 6736.82)
    train/valid/test-data-iterators-setup ..........: (113755.24, 113757.17)
[before the start of training step] datetime: 2024-03-01 01:00:54 
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
no transformer engine
[2024-03-01 01:01:10,512] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-01 01:01:10,514] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-01 01:01:10,961] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-01 01:01:11,052] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
no transformer engine
no transformer engine
no transformer engine
no transformer engine
[2024-03-01 01:01:43,278] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-01 01:01:43,578] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
no transformer engine
no transformer engine
no transformer engine
no transformer engine
[2024-03-01 01:01:45,594] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-01 01:01:46,044] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
NCCL version 2.18.5+cuda11.8
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO comm 0x1cfc6050 rank 0 nranks 2 cudaDev 6 nvmlDev 8 busId d5000 commId 0x4c855c136d838df5 - Init START
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO comm 0x14a11d30 rank 1 nranks 2 cudaDev 7 nvmlDev 9 busId d6000 commId 0x4c855c136d838df5 - Init START
NCCL version 2.18.5+cuda11.8
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO comm 0x27fb74a0 rank 1 nranks 2 cudaDev 5 nvmlDev 6 busId d1000 commId 0x2850a70a7aa27e6 - Init START
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO comm 0x257c7f50 rank 0 nranks 2 cudaDev 4 nvmlDev 5 busId ce000 commId 0x2850a70a7aa27e6 - Init START
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Setting affinity for GPU 8 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO Setting affinity for GPU 9 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Channel 00/04 :    0   1
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Channel 01/04 :    0   1
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Channel 02/04 :    0   1
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Channel 03/04 :    0   1
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO Channel 00/0 : 1[9] -> 0[8] via P2P/IPC
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO Channel 01/0 : 1[9] -> 0[8] via P2P/IPC
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO Channel 02/0 : 1[9] -> 0[8] via P2P/IPC
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Channel 00/04 :    0   1
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Channel 01/04 :    0   1
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Channel 02/04 :    0   1
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Channel 03/04 :    0   1
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO Channel 03/0 : 1[9] -> 0[8] via P2P/IPC
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Channel 00/0 : 0[8] -> 1[9] via P2P/IPC
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Channel 01/0 : 0[8] -> 1[9] via P2P/IPC
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Channel 02/0 : 0[8] -> 1[9] via P2P/IPC
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Channel 03/0 : 0[8] -> 1[9] via P2P/IPC
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Channel 00/0 : 0[5] -> 1[6] via P2P/IPC
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO Channel 00/0 : 1[6] -> 0[5] via P2P/IPC
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Channel 01/0 : 0[5] -> 1[6] via P2P/IPC
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO Channel 01/0 : 1[6] -> 0[5] via P2P/IPC
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO Channel 02/0 : 1[6] -> 0[5] via P2P/IPC
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Channel 02/0 : 0[5] -> 1[6] via P2P/IPC
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO Channel 03/0 : 1[6] -> 0[5] via P2P/IPC
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Channel 03/0 : 0[5] -> 1[6] via P2P/IPC
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
ps-SYS-420GP-TNR:2816648:2820990 [4] NCCL INFO comm 0x257c7f50 rank 0 nranks 2 cudaDev 4 nvmlDev 5 busId ce000 commId 0x2850a70a7aa27e6 - Init COMPLETE
ps-SYS-420GP-TNR:2816649:2820993 [5] NCCL INFO comm 0x27fb74a0 rank 1 nranks 2 cudaDev 5 nvmlDev 6 busId d1000 commId 0x2850a70a7aa27e6 - Init COMPLETE
ps-SYS-420GP-TNR:2816650:2820987 [6] NCCL INFO comm 0x1cfc6050 rank 0 nranks 2 cudaDev 6 nvmlDev 8 busId d5000 commId 0x4c855c136d838df5 - Init COMPLETE
ps-SYS-420GP-TNR:2816651:2820988 [7] NCCL INFO comm 0x14a11d30 rank 1 nranks 2 cudaDev 7 nvmlDev 9 busId d6000 commId 0x4c855c136d838df5 - Init COMPLETE
NCCL version 2.18.5+cuda11.8
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO comm 0x275b17e0 rank 1 nranks 2 cudaDev 3 nvmlDev 4 busId 57000 commId 0x140e4b22c793aca5 - Init START
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO comm 0x1ff07320 rank 0 nranks 2 cudaDev 2 nvmlDev 3 busId 56000 commId 0x140e4b22c793aca5 - Init START
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Channel 00/04 :    0   1
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Channel 01/04 :    0   1
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Channel 02/04 :    0   1
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Channel 03/04 :    0   1
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO Channel 00/0 : 1[4] -> 0[3] via P2P/IPC
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Channel 00/0 : 0[3] -> 1[4] via P2P/IPC
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO Channel 01/0 : 1[4] -> 0[3] via P2P/IPC
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Channel 01/0 : 0[3] -> 1[4] via P2P/IPC
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO Channel 02/0 : 1[4] -> 0[3] via P2P/IPC
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Channel 02/0 : 0[3] -> 1[4] via P2P/IPC
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO Channel 03/0 : 1[4] -> 0[3] via P2P/IPC
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Channel 03/0 : 0[3] -> 1[4] via P2P/IPC
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
ps-SYS-420GP-TNR:2816646:2821271 [2] NCCL INFO comm 0x1ff07320 rank 0 nranks 2 cudaDev 2 nvmlDev 3 busId 56000 commId 0x140e4b22c793aca5 - Init COMPLETE
ps-SYS-420GP-TNR:2816647:2821272 [3] NCCL INFO comm 0x275b17e0 rank 1 nranks 2 cudaDev 3 nvmlDev 4 busId 57000 commId 0x140e4b22c793aca5 - Init COMPLETE
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO comm 0x219770f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4f000 commId 0x5fc1077496981601 - Init START
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO comm 0x247b5c60 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0x5fc1077496981601 - Init START
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Channel 00/04 :    0   1
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Channel 01/04 :    0   1
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Channel 02/04 :    0   1
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Channel 03/04 :    0   1
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO P2P Chunksize set to 524288
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
ps-SYS-420GP-TNR:2816644:2821493 [0] NCCL INFO comm 0x219770f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4f000 commId 0x5fc1077496981601 - Init COMPLETE
ps-SYS-420GP-TNR:2816645:2821494 [1] NCCL INFO comm 0x247b5c60 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0x5fc1077496981601 - Init COMPLETE
NCCL version 2.18.5+cuda11.8
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816646:2822083 [2] NCCL INFO Using network Socket
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO comm 0x13ba4c30 rank 3 nranks 4 cudaDev 6 nvmlDev 8 busId d5000 commId 0xadc4186e0f686661 - Init START
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO comm 0x1e94b220 rank 2 nranks 4 cudaDev 4 nvmlDev 5 busId ce000 commId 0xadc4186e0f686661 - Init START
ps-SYS-420GP-TNR:2816646:2822083 [2] NCCL INFO comm 0x1825ca00 rank 1 nranks 4 cudaDev 2 nvmlDev 3 busId 56000 commId 0xadc4186e0f686661 - Init START
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO comm 0x1aafabb0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 4f000 commId 0xadc4186e0f686661 - Init START
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO Setting affinity for GPU 8 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO comm 0x237d3820 rank 1 nranks 4 cudaDev 3 nvmlDev 4 busId 57000 commId 0x5d713e669f1b99f1 - Init START
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO comm 0x1ddf8f60 rank 0 nranks 4 cudaDev 1 nvmlDev 1 busId 52000 commId 0x5d713e669f1b99f1 - Init START
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO comm 0x2647dab0 rank 3 nranks 4 cudaDev 7 nvmlDev 9 busId d6000 commId 0x5d713e669f1b99f1 - Init START
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO comm 0x241e5240 rank 2 nranks 4 cudaDev 5 nvmlDev 6 busId d1000 commId 0x5d713e669f1b99f1 - Init START
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816646:2822083 [2] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816646:2822083 [2] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
ps-SYS-420GP-TNR:2816646:2822083 [2] NCCL INFO P2P Chunksize set to 131072
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO P2P Chunksize set to 131072
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO Channel 00/02 :    0   1   2   3
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO Channel 01/02 :    0   1   2   3
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO P2P Chunksize set to 131072
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO P2P Chunksize set to 131072
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO Channel 00 : 0[0] -> 1[3] via SHM/direct/direct
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO Channel 00 : 2[5] -> 3[8] via SHM/direct/direct
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO Channel 01 : 0[0] -> 1[3] via SHM/direct/direct
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO Channel 01 : 2[5] -> 3[8] via SHM/direct/direct
ps-SYS-420GP-TNR:2816646:2822083 [2] NCCL INFO Channel 00 : 1[3] -> 2[5] via SHM/direct/direct
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO Channel 00 : 3[8] -> 0[0] via SHM/direct/direct
ps-SYS-420GP-TNR:2816646:2822083 [2] NCCL INFO Channel 01 : 1[3] -> 2[5] via SHM/direct/direct
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO Channel 01 : 3[8] -> 0[0] via SHM/direct/direct

ps-SYS-420GP-TNR:2816644:2822074 [0] transport.cc:154 NCCL WARN Cuda failure 'invalid argument'
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO init.cc:1079 -> 1
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO init.cc:1358 -> 1
ps-SYS-420GP-TNR:2816644:2822074 [0] NCCL INFO group.cc:65 -> 1 [Async thread]
ps-SYS-420GP-TNR:2816644:2816644 [0] NCCL INFO group.cc:406 -> 1
ps-SYS-420GP-TNR:2816644:2816644 [0] NCCL INFO group.cc:96 -> 1
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816646:2822083 [2] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO Channel 00 : 3[8] -> 2[5] via SHM/direct/direct
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO Setting affinity for GPU 9 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO Channel 01 : 3[8] -> 2[5] via SHM/direct/direct
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
ps-SYS-420GP-TNR:2816646:2822083 [2] NCCL INFO Channel 00 : 1[3] -> 0[0] via SHM/direct/direct
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO Channel 00 : 2[5] -> 1[3] via SHM/direct/direct
ps-SYS-420GP-TNR:2816646:2822083 [2] NCCL INFO Channel 01 : 1[3] -> 0[0] via SHM/direct/direct
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO Channel 01 : 2[5] -> 1[3] via SHM/direct/direct
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO P2P Chunksize set to 131072
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO Channel 00/02 :    0   1   2   3
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO Channel 01/02 :    0   1   2   3
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO P2P Chunksize set to 131072
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO P2P Chunksize set to 131072
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO P2P Chunksize set to 131072
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816650:2822076 [6] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816648:2822075 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO Channel 00 : 1[4] -> 2[6] via SHM/direct/direct
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO Channel 01 : 1[4] -> 2[6] via SHM/direct/direct
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO Channel 00 : 3[9] -> 0[1] via SHM/direct/direct
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO Channel 00 : 0[1] -> 1[4] via SHM/direct/direct
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO Channel 01 : 0[1] -> 1[4] via SHM/direct/direct
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO Channel 00 : 2[6] -> 3[9] via SHM/direct/direct
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO Channel 01 : 2[6] -> 3[9] via SHM/direct/direct
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO Channel 00 : 1[4] -> 0[1] via SHM/direct/direct
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO Channel 01 : 1[4] -> 0[1] via SHM/direct/direct
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO Channel 01 : 3[9] -> 0[1] via SHM/direct/direct
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO Connected all rings
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO Connected all rings
Traceback (most recent call last):
  File "/home/www/models/gpt/megatron_lm/trains/train_scaled_v55.py", line 291, in <module>
    pretrain(get_dataset,
  File "/home/www/models/gpt/megatron_lm/megatron/training.py", line 258, in pretrain
    iteration, num_floating_point_operations_so_far = train(
  File "/home/www/models/gpt/megatron_lm/megatron/training.py", line 970, in train
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO Channel 00 : 3[9] -> 2[6] via SHM/direct/direct
    train_step(forward_step_func,
  File "/home/www/models/gpt/megatron_lm/megatron/training.py", line 535, in train_step
    losses_reduced = forward_backward_func(
  File "/home/www/models/gpt/megatron_lm/megatron/core/pipeline_parallel/schedules.py", line 395, in forward_backward_no_pipelining
    config.finalize_model_grads_func([model])
  File "/home/www/models/gpt/megatron_lm/megatron/core/distributed/finalize_model_grads.py", line 129, in finalize_model_grads
    model_chunk.finish_grad_sync()
  File "/home/www/models/gpt/megatron_lm/megatron/core/distributed/distributed_data_parallel.py", line 196, in finish_grad_sync
    grad_buffer.finish_grad_sync()
  File "/home/www/models/gpt/megatron_lm/megatron/core/distributed/grad_buffer.py", line 417, in finish_grad_sync
    bucket.finish_grad_sync()
  File "/home/www/models/gpt/megatron_lm/megatron/core/distributed/grad_buffer.py", line 126, in finish_grad_sync
    self.start_grad_sync()
  File "/home/www/models/gpt/megatron_lm/megatron/core/distributed/grad_buffer.py", line 104, in start_grad_sync
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO Channel 01 : 3[9] -> 2[6] via SHM/direct/direct
    self.communication_handle = torch.distributed._reduce_scatter_base(
  File "/home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3408, in _reduce_scatter_base
    return reduce_scatter_tensor(output, input, op, group, async_op)
  File "/home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3375, in reduce_scatter_tensor
    work = group._reduce_scatter_base(output, input, opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'
Exception raised from getNCCLComm at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f330c472617 in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x1adf (0x7f330d9329bf in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::_reduce_scatter_base(at::Tensor&, at::Tensor&, c10d::ReduceScatterOptions const&) + 0x677 (0x7f330d9523d7 in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x5671b3b (0x7f3364d11b3b in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x567d424 (0x7f3364d1d424 in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x4ca79bb (0x7f33643479bb in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x4ca599c (0x7f336434599c in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x19fd7e8 (0x7f336109d7e8 in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x5683703 (0x7f3364d23703 in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x568b1f7 (0x7f3364d2b1f7 in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0xc4b8e6 (0x7f33778818e6 in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x3f7674 (0x7f337702d674 in /home/www/anaconda3/envs/cuda11/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: /home/www/anaconda3/envs/cuda11/bin/python() [0x4fc697]
frame #13: _PyObject_MakeTpCall + 0x25b (0x4f614b in /home/www/anaconda3/envs/cuda11/bin/python)
frame #14: /home/www/anaconda3/envs/cuda11/bin/python() [0x50819f]
frame #15: _PyEval_EvalFrameDefault + 0x4b26 (0x4f1ac6 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #16: _PyFunction_Vectorcall + 0x6f (0x4fcadf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x2b79 (0x4efb19 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #18: _PyFunction_Vectorcall + 0x6f (0x4fcadf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x31f (0x4ed2bf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #20: _PyFunction_Vectorcall + 0x6f (0x4fcadf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x13b3 (0x4ee353 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #22: _PyFunction_Vectorcall + 0x6f (0x4fcadf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x731 (0x4ed6d1 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #24: _PyFunction_Vectorcall + 0x6f (0x4fcadf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x731 (0x4ed6d1 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #26: _PyFunction_Vectorcall + 0x6f (0x4fcadf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x731 (0x4ed6d1 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #28: /home/www/anaconda3/envs/cuda11/bin/python() [0x507eae]
frame #29: _PyEval_EvalFrameDefault + 0x4b26 (0x4f1ac6 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #30: _PyFunction_Vectorcall + 0x6f (0x4fcadf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x4b26 (0x4f1ac6 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #32: _PyFunction_Vectorcall + 0x6f (0x4fcadf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x13b3 (0x4ee353 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #34: _PyFunction_Vectorcall + 0x6f (0x4fcadf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x31f (0x4ed2bf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #36: _PyFunction_Vectorcall + 0x6f (0x4fcadf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x31f (0x4ed2bf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #38: _PyFunction_Vectorcall + 0x6f (0x4fcadf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x31f (0x4ed2bf in /home/www/anaconda3/envs/cuda11/bin/python)
frame #40: /home/www/anaconda3/envs/cuda11/bin/python() [0x591d92]
frame #41: PyEval_EvalCode + 0x87 (0x591cd7 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #42: /home/www/anaconda3/envs/cuda11/bin/python() [0x5c2967]
frame #43: /home/www/anaconda3/envs/cuda11/bin/python() [0x5bdad0]
frame #44: /home/www/anaconda3/envs/cuda11/bin/python() [0x45956b]
frame #45: _PyRun_SimpleFileObject + 0x19f (0x5b805f in /home/www/anaconda3/envs/cuda11/bin/python)
frame #46: _PyRun_AnyFileObject + 0x43 (0x5b7dc3 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #47: Py_RunMain + 0x38d (0x5b4b7d in /home/www/anaconda3/envs/cuda11/bin/python)
frame #48: Py_BytesMain + 0x39 (0x584e49 in /home/www/anaconda3/envs/cuda11/bin/python)
frame #49: __libc_start_main + 0xf3 (0x7f33a8cc3083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #50: /home/www/anaconda3/envs/cuda11/bin/python() [0x584cfe]

ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO Channel 00 : 2[6] -> 1[4] via SHM/direct/direct
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO Channel 01 : 2[6] -> 1[4] via SHM/direct/direct
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO Connected all trees
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ps-SYS-420GP-TNR:2816649:2822080 [5] NCCL INFO comm 0x241e5240 rank 2 nranks 4 cudaDev 5 nvmlDev 6 busId d1000 commId 0x5d713e669f1b99f1 - Init COMPLETE
ps-SYS-420GP-TNR:2816651:2822079 [7] NCCL INFO comm 0x2647dab0 rank 3 nranks 4 cudaDev 7 nvmlDev 9 busId d6000 commId 0x5d713e669f1b99f1 - Init COMPLETE
ps-SYS-420GP-TNR:2816645:2822078 [1] NCCL INFO comm 0x1ddf8f60 rank 0 nranks 4 cudaDev 1 nvmlDev 1 busId 52000 commId 0x5d713e669f1b99f1 - Init COMPLETE
ps-SYS-420GP-TNR:2816647:2822081 [3] NCCL INFO comm 0x237d3820 rank 1 nranks 4 cudaDev 3 nvmlDev 4 busId 57000 commId 0x5d713e669f1b99f1 - Init COMPLETE
ps-SYS-420GP-TNR:2816644:2816644 [0] NCCL INFO comm 0x1aafabb0 rank 0 nranks 4 cudaDev 0 busId 4f000 - Abort COMPLETE
ps-SYS-420GP-TNR:2816644:2821495 [0] NCCL INFO [Service thread] Connection closed by localRank 0
ps-SYS-420GP-TNR:2816644:2816644 [0] NCCL INFO comm 0x219770f0 rank 0 nranks 2 cudaDev 0 busId 4f000 - Abort COMPLETE
ps-SYS-420GP-TNR:2816644:2817122 [0] NCCL INFO [Service thread] Connection closed by localRank 0
ps-SYS-420GP-TNR:2816644:2816644 [0] NCCL INFO comm 0x9a01be0 rank 0 nranks 8 cudaDev 0 busId 4f000 - Abort COMPLETE
/home/www/anaconda3/envs/cuda11/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[2024-03-01 01:02:57,902] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2816644
[2024-03-01 01:02:57,904] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2816645
[2024-03-01 01:02:58,225] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2816646
/home/www/anaconda3/envs/cuda11/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 16 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[2024-03-01 01:02:59,101] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2816647
[2024-03-01 01:02:59,379] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2816648
/home/www/anaconda3/envs/cuda11/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 16 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[2024-03-01 01:03:00,320] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2816649
[2024-03-01 01:03:00,640] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2816650
/home/www/anaconda3/envs/cuda11/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 16 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[2024-03-01 01:03:01,528] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2816651
[2024-03-01 01:03:01,846] [ERROR] [launch.py:321:sigkill_handler] ['/home/www/anaconda3/envs/cuda11/bin/python', '-u', '/home/www/models/gpt/megatron_lm/trains/train_scaled_v55.py', '--local_rank=7', '--tensor-model-parallel-size', '2', '--sequence-parallel', '--use-flash-attn', '--optimizer', 'adam', '--recompute-activations', '--num-layers', '64', '--hidden-size', '3072', '--num-attention-heads', '32', '--seq-length', '3400', '--max-position-embeddings', '4000', '--micro-batch-size', '1', '--global-batch-size', '120', '--lr', '0.0001', '--train-iters', '500000', '--lr-decay-iters', '320000', '--lr-decay-style', 'cosine', '--min-lr', '1.0e-5', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--bf16', '--data-path', '$DATA_PATH', '--vocab-file', '$VOCAB_FILE', '--merge-file', '$MERGE_FILE', '--split', '999,1,1', '--log-interval', '10', '--save-interval', '1000', '--eval-interval', '10000', '--eval-iters', '1', '--tokenizer-type', 't5', '--untie-embeddings-and-output-weights', '--use-rotary-position-embeddings', '--swiglu', '--save', '/data3/www/checkpoints/dones/10b_4_2', '--dataloader-type', 'cyclic', '--load', '/data3/www/checkpoints/dones/10b_4_1', '--finetune', '--initial-loss-scale', '8192', '--tensorboard-queue-size', '1', '--log-timers-to-tensorboard', '--log-batch-size-to-tensorboard', '--log-validation-ppl-to-tensorboard', '--tensorboard-dir', '/data3/www/checkpoints/dones/10b_4_1', '--use-distributed-optimizer', '--spec', 'local'] exits with return code = 1
wdykas commented 4 months ago

can you run with the nccl debug flag above. Also are those no_transformer_engine prints from your code?