NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.44k stars 2.34k forks source link

[BUG] #516

Closed awsankur closed 1 year ago

awsankur commented 1 year ago

Describe the bug Single node (8 A100 GPUs) training with pretrain_gpt.py errors out.

To Reproduce Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

  1. Pull Docker Image: docker pull nvcr.io/nvidia/pytorch:23.05-py3
  2. Run Container: docker run --gpus all --name pytorch-container -d -i -t -v /home/ubuntu/data:/data nvcr.io/nvidia/pytorch:23.05-py3 /bin/bash
  3. Run Training: torchrun --standalone --nnodes=1 --nproc_per_node=8 /workspace/Megatron-LM/pretrain_gpt.py --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 8 --global-batch-size 64 --lr 0.00015 --train-iters 500000 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --lr-warmup-fraction .01 --clip-grad 1.0 --fp16 --data-path /data/fsx/gpt2/my-gpt2_text_document --vocab-file /data/fsx/gpt2/gpt2-vocab.json --merge-file /data/fsx/gpt2/gpt2-merges.txt --split 949,50,1 --log-interval 1 --save-interval 10000 --eval-interval 1000 --eval-iters 40

Expected behavior A clear and concise description of what you expected to happen. Something like:

1:  iteration        3/  286102 | consumed samples:         1536 | elapsed time per iteration (ms): 37143.0 | learning rate: 0.000E+00 | global batch size:   512 | loss scale: 1073741824.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
1:  iteration        4/  286102 | consumed samples:         2048 | elapsed time per iteration (ms): 36836.0 | learning rate: 0.000E+00 | global batch size:   512 | loss scale: 536870912.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
1:  iteration        5/  286102 | consumed samples:         2560 | elapsed time per iteration (ms): 36775.3 | learning rate: 0.000E+00 | global batch size:   512 | loss scale: 268435456.0 | number of skipped iterations:   1 | number of nan iterations:   0 |

Stack trace/logs If applicable, add the stack trace or logs from the time of the error.

============================================================ root@173632d0f02d:/workspace# export CUDA_DEVICE_MAX_CONNECTIONS=1 root@173632d0f02d:/workspace# torchrun --standalone --nnodes=1 --nproc_per_node=8 /workspace/Megatron-LM/pretrain_gpt.py --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 8 --global-batch-size 64 --lr 0.00015 --train-iters 500000 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --lr-warmup-fraction .01 --clip-grad 1.0 --fp16 --data-path /data/fsx/gpt2/my-gpt2_text_document --vocab-file /data/fsx/gpt2/gpt2-vocab.json --merge-file /data/fsx/gpt2/gpt2-merges.txt --split 949,50,1 --log-interval 1 --save-interval 10000 --eval-interval 1000 --eval-iters 40 master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages using world size: 8, data-parallel-size: 8, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 add_bias_linear ................................. True add_position_embedding .......................... True adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_layernorm_1p .............................. False apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False async_tensor_model_parallel_allreduce ........... True attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False barrier_with_L1_time ............................ True bert_binary_head ................................ True bert_embedder_type .............................. megatron bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None check_for_nan_in_loss_and_grad .................. True classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 data_cache_path ................................. None data_parallel_random_init ....................... False data_parallel_size .............................. 8 data_path ....................................... ['/data/fsx/gpt2/my-gpt2_text_document'] data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single decoder_num_layers .............................. None decoder_seq_length .............................. None dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 distribute_saved_activations .................... False distributed_backend ............................. nccl distributed_timeout_minutes ..................... 10 embedding_path .................................. None embedding_weights_in_fp32 ....................... False empty_unused_memory_level ....................... 0 encoder_num_layers .............................. 24 encoder_seq_length .............................. 1024 end_weight_decay ................................ 0.01 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 40 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_on_missing_checkpoint ...................... False exit_signal_handler ............................. False ffn_hidden_size ................................. 4096 finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8 ............................................. None fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 64 gradient_accumulation_fusion .................... True group_query_attention ........................... False head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.1 hidden_size ..................................... 1024 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 iter_per_epoch .................................. 1250 kv_channels ..................................... 64 lazy_mpu_init ................................... None load ............................................ None local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 1 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.00015 lr_decay_iters .................................. 320000 lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_warmup_fraction .............................. 0.01 lr_warmup_init .................................. 0.0 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... True max_position_embeddings ......................... 1024 max_tokens_to_oom ............................... 12000 merge_file ...................................... /data/fsx/gpt2/gpt2-merges.txt micro_batch_size ................................ 8 min_loss_scale .................................. 1.0 min_lr .......................................... 1e-05 mmap_warmup ..................................... False no_load_optim ................................... None no_load_rng ..................................... None no_persist_layer_norm ........................... False no_save_optim ................................... None no_save_rng ..................................... None norm_epsilon .................................... 1e-05 normalization ................................... LayerNorm num_attention_heads ............................. 16 num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... None num_layers ...................................... 24 num_layers_per_virtual_pipeline_stage ........... None num_query_groups ................................ 1 num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam output_bert_embeddings .......................... False overlap_grad_reduce ............................. False overlap_p2p_comm ................................ False override_opt_param_scheduler .................... False params_dtype .................................... torch.float16 patch_dim ....................................... 16 perform_initialization .......................... True pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... learned_absolute profile ......................................... False profile_ranks ................................... [0] profile_step_end ................................ 12 profile_step_start .............................. 10 query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ None reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 retro_add_retriever ............................. False retro_cyclic_train_iters ........................ None retro_encoder_attention_dropout ................. 0.1 retro_encoder_hidden_dropout .................... 0.1 retro_encoder_layers ............................ 2 retro_num_neighbors ............................. 2 retro_num_retrieved_chunks ...................... 2 retro_return_doc_ids ............................ False retro_workdir ................................... None rotary_percent .................................. 1.0 rotary_seq_len_interpolation_factor ............. None sample_rate ..................................... 1.0 save ............................................ None save_interval ................................... 10000 scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 1024 sequence_parallel ............................... False sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 skip_train ...................................... False split ........................................... 949,50,1 squared_relu .................................... False standalone_embedding_stage ...................... False start_weight_decay .............................. 0.01 swiglu .......................................... False swin_backbone_type .............................. tiny tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None timing_log_level ................................ 0 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_model ................................. None tokenizer_type .................................. GPT2BPETokenizer train_data_path ................................. None train_iters ..................................... 500000 train_samples ................................... None transformer_impl ................................ local transformer_pipeline_model_parallel_size ........ 1 untie_embeddings_and_output_weights ............. False use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_cpu_initialization .......................... None use_distributed_optimizer ....................... False use_flash_attn .................................. False use_one_sent_docs ............................... False use_ring_exchange_p2p ........................... False use_rotary_position_embeddings .................. False valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... /data/fsx/gpt2/gpt2-vocab.json vocab_size ...................................... None weight_decay .................................... 0.01 weight_decay_incr_style ......................... constant world_size ...................................... 8 -------------------- end of arguments --------------------- setting number of micro-batches to constant 1

building GPT2BPETokenizer tokenizer ... padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) initializing torch distributed ... initialized tensor model parallel with size 1 initialized pipeline model parallel with size 1 setting random seeds to 1234 ... compiling dataset index builder ... make: Entering directory '/workspace/Megatron-LM/megatron/data' g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.10 -I/usr/local/lib/python3.10/dist-packages/pybind11/include helpers.cpp -o helpers.cpython-310-x86_64-linux-gnu.so make: Leaving directory '/workspace/Megatron-LM/megatron/data'

done with dataset index builder. Compilation time: 5.231 seconds compiling and loading fused kernels ... done with compiling and loading fused kernels. Compilation time: 11.328 seconds time to initialize megatron (seconds): 18.750 [after megatron is initialized] datetime: 2023-09-22 21:36:50 building GPT model ... number of parameters on (tensor, pipeline) model parallel rank (0, 0): 354871296 buckets for gradient all-reduce: params for bucket 1 module.language_model.encoder.layers.21.post_attention_norm.bias module.language_model.encoder.layers.13.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.8.post_attention_norm.weight module.language_model.encoder.layers.3.self_attention.query_key_value.weight module.language_model.encoder.layers.20.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.14.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.9.self_attention.dense.weight module.language_model.encoder.layers.4.self_attention.dense.bias module.language_model.encoder.layers.22.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.19.input_norm.weight module.language_model.encoder.layers.15.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.10.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.17.self_attention.query_key_value.bias module.language_model.encoder.layers.11.post_attention_norm.bias module.language_model.encoder.layers.6.self_attention.dense.bias module.language_model.encoder.layers.1.input_norm.bias module.language_model.embedding.word_embeddings.weight module.language_model.encoder.layers.18.input_norm.weight module.language_model.encoder.layers.23.post_attention_norm.bias module.language_model.encoder.layers.21.self_attention.query_key_value.bias module.language_model.encoder.layers.12.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.7.post_attention_norm.weight module.language_model.encoder.layers.2.self_attention.query_key_value.weight module.language_model.encoder.layers.13.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.8.self_attention.dense.weight module.language_model.encoder.layers.3.self_attention.dense.bias module.language_model.encoder.layers.22.self_attention.query_key_value.weight module.language_model.encoder.layers.14.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.9.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.4.post_attention_norm.weight module.language_model.encoder.layers.16.self_attention.query_key_value.bias module.language_model.encoder.layers.10.post_attention_norm.bias module.language_model.encoder.layers.20.self_attention.dense.weight module.language_model.encoder.layers.17.input_norm.weight module.language_model.encoder.layers.11.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.6.post_attention_norm.weight module.language_model.encoder.layers.1.self_attention.query_key_value.weight module.language_model.encoder.layers.23.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.18.input_norm.bias module.language_model.encoder.layers.12.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.7.self_attention.dense.weight module.language_model.encoder.layers.2.self_attention.dense.bias module.language_model.encoder.layers.0.self_attention.query_key_value.bias module.language_model.encoder.layers.0.input_norm.weight module.language_model.encoder.layers.21.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.13.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.8.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.3.post_attention_norm.weight module.language_model.encoder.layers.19.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.15.self_attention.query_key_value.bias module.language_model.encoder.layers.9.post_attention_norm.bias module.language_model.encoder.layers.4.self_attention.dense.weight module.language_model.encoder.layers.0.input_norm.bias module.language_model.encoder.layers.21.self_attention.query_key_value.weight module.language_model.encoder.layers.19.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.16.input_norm.weight module.language_model.encoder.layers.10.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.5.self_attention.dense.bias module.language_model.encoder.layers.21.self_attention.dense.weight module.language_model.encoder.layers.17.input_norm.bias module.language_model.encoder.layers.11.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.6.self_attention.dense.weight module.language_model.encoder.layers.1.self_attention.dense.bias module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.18.self_attention.query_key_value.weight module.language_model.encoder.layers.23.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.12.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.7.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.2.post_attention_norm.weight module.language_model.encoder.layers.14.self_attention.query_key_value.bias module.language_model.encoder.layers.8.post_attention_norm.bias module.language_model.encoder.layers.3.self_attention.dense.weight module.language_model.encoder.layers.22.self_attention.dense.bias module.language_model.encoder.layers.21.input_norm.bias module.language_model.encoder.layers.15.input_norm.weight module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.4.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.20.post_attention_norm.bias module.language_model.encoder.layers.16.input_norm.bias module.language_model.encoder.layers.10.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.5.self_attention.dense.weight module.language_model.encoder.layers.0.self_attention.dense.bias module.language_model.encoder.layers.21.input_norm.weight module.language_model.encoder.layers.17.self_attention.query_key_value.weight module.language_model.encoder.layers.11.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.6.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.1.post_attention_norm.weight module.language_model.encoder.layers.23.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.19.post_attention_norm.bias module.language_model.encoder.layers.18.self_attention.dense.bias module.language_model.encoder.layers.13.self_attention.query_key_value.bias module.language_model.encoder.layers.7.post_attention_norm.bias module.language_model.encoder.layers.2.self_attention.dense.weight module.language_model.encoder.layers.21.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.14.input_norm.weight module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.3.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.15.input_norm.bias module.language_model.encoder.layers.9.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.4.post_attention_norm.bias module.language_model.encoder.layers.16.self_attention.query_key_value.weight module.language_model.encoder.layers.10.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.5.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.0.post_attention_norm.weight module.language_model.encoder.layers.22.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.17.self_attention.dense.bias module.language_model.encoder.layers.12.self_attention.query_key_value.bias module.language_model.encoder.layers.6.post_attention_norm.bias module.language_model.encoder.layers.1.self_attention.dense.weight module.language_model.encoder.layers.19.post_attention_norm.weight module.language_model.encoder.layers.18.post_attention_norm.weight module.language_model.encoder.layers.13.input_norm.weight module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.2.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.14.input_norm.bias module.language_model.encoder.layers.8.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.3.post_attention_norm.bias module.language_model.encoder.layers.22.post_attention_norm.weight module.language_model.encoder.layers.15.self_attention.query_key_value.weight module.language_model.encoder.layers.9.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.0.self_attention.query_key_value.weight module.language_model.encoder.layers.5.post_attention_norm.bias module.language_model.encoder.layers.0.self_attention.dense.weight module.language_model.encoder.final_norm.bias module.language_model.encoder.layers.21.self_attention.dense.bias module.language_model.encoder.layers.16.self_attention.dense.bias module.language_model.encoder.layers.11.self_attention.query_key_value.bias module.language_model.embedding.position_embeddings.weight module.language_model.encoder.layers.23.self_attention.query_key_value.bias module.language_model.encoder.layers.17.post_attention_norm.weight module.language_model.encoder.layers.12.input_norm.weight module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.1.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.19.input_norm.bias module.language_model.encoder.layers.18.self_attention.dense.weight module.language_model.encoder.layers.13.input_norm.bias module.language_model.encoder.layers.7.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.2.post_attention_norm.bias module.language_model.encoder.layers.21.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.19.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.14.self_attention.query_key_value.weight module.language_model.encoder.layers.8.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.20.input_norm.bias module.language_model.encoder.layers.15.self_attention.dense.bias module.language_model.encoder.layers.10.self_attention.query_key_value.bias module.language_model.encoder.layers.4.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.16.post_attention_norm.weight module.language_model.encoder.layers.11.input_norm.weight module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.0.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.23.input_norm.weight module.language_model.encoder.layers.17.self_attention.dense.weight module.language_model.encoder.layers.12.input_norm.bias module.language_model.encoder.layers.6.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.1.post_attention_norm.bias module.language_model.encoder.layers.20.input_norm.weight module.language_model.encoder.layers.18.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.13.self_attention.query_key_value.weight module.language_model.encoder.layers.7.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.20.self_attention.query_key_value.weight module.language_model.encoder.layers.14.self_attention.dense.bias module.language_model.encoder.layers.9.self_attention.query_key_value.bias module.language_model.encoder.layers.3.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.22.self_attention.dense.weight module.language_model.encoder.layers.15.post_attention_norm.weight module.language_model.encoder.layers.10.input_norm.weight module.language_model.encoder.layers.4.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.16.self_attention.dense.weight module.language_model.encoder.layers.11.input_norm.bias module.language_model.encoder.layers.5.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.0.post_attention_norm.bias module.language_model.encoder.layers.17.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.23.input_norm.bias module.language_model.encoder.layers.12.self_attention.query_key_value.weight module.language_model.encoder.layers.6.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.18.post_attention_norm.bias module.language_model.encoder.layers.13.self_attention.dense.bias module.language_model.encoder.layers.8.self_attention.query_key_value.bias module.language_model.encoder.layers.2.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.22.self_attention.query_key_value.bias module.language_model.encoder.layers.19.self_attention.query_key_value.weight module.language_model.encoder.layers.14.post_attention_norm.weight module.language_model.encoder.layers.9.input_norm.weight module.language_model.encoder.layers.3.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.15.self_attention.dense.weight module.language_model.encoder.layers.10.input_norm.bias module.language_model.encoder.layers.5.self_attention.query_key_value.bias module.language_model.encoder.layers.21.post_attention_norm.weight module.language_model.encoder.layers.16.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.11.self_attention.query_key_value.weight module.language_model.encoder.layers.5.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.0.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.23.self_attention.query_key_value.weight module.language_model.encoder.layers.17.post_attention_norm.bias module.language_model.encoder.layers.12.self_attention.dense.bias module.language_model.encoder.layers.7.self_attention.query_key_value.bias module.language_model.encoder.layers.1.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.18.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.19.self_attention.dense.weight module.language_model.encoder.layers.13.post_attention_norm.weight module.language_model.encoder.layers.8.input_norm.weight module.language_model.encoder.layers.2.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.14.self_attention.dense.weight module.language_model.encoder.layers.9.input_norm.bias module.language_model.encoder.layers.4.self_attention.query_key_value.bias module.language_model.encoder.layers.22.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.20.self_attention.query_key_value.bias module.language_model.encoder.layers.15.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.10.self_attention.query_key_value.weight module.language_model.encoder.layers.5.input_norm.weight module.language_model.encoder.layers.16.post_attention_norm.bias module.language_model.encoder.layers.11.self_attention.dense.bias module.language_model.encoder.layers.6.self_attention.query_key_value.bias module.language_model.encoder.layers.23.self_attention.dense.bias module.language_model.encoder.layers.20.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.17.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.12.post_attention_norm.weight module.language_model.encoder.layers.7.input_norm.weight module.language_model.encoder.layers.1.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.18.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.13.self_attention.dense.weight module.language_model.encoder.layers.8.input_norm.bias module.language_model.encoder.layers.3.self_attention.query_key_value.bias module.language_model.encoder.layers.22.input_norm.weight module.language_model.encoder.layers.14.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.9.self_attention.query_key_value.weight module.language_model.encoder.layers.4.input_norm.weight module.language_model.encoder.layers.15.post_attention_norm.bias module.language_model.encoder.layers.10.self_attention.dense.bias module.language_model.encoder.layers.5.input_norm.bias module.language_model.encoder.layers.16.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.11.post_attention_norm.weight module.language_model.encoder.layers.6.input_norm.weight module.language_model.encoder.layers.0.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.23.post_attention_norm.weight module.language_model.encoder.layers.20.post_attention_norm.weight module.language_model.encoder.layers.17.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.12.self_attention.dense.weight module.language_model.encoder.layers.7.input_norm.bias module.language_model.encoder.layers.2.self_attention.query_key_value.bias module.language_model.encoder.layers.21.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.18.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.13.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.8.self_attention.query_key_value.weight module.language_model.encoder.layers.3.input_norm.weight module.language_model.encoder.layers.20.self_attention.dense.bias module.language_model.encoder.layers.14.post_attention_norm.bias module.language_model.encoder.layers.9.self_attention.dense.bias module.language_model.encoder.layers.4.input_norm.bias module.language_model.encoder.layers.22.post_attention_norm.bias module.language_model.encoder.layers.15.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.10.post_attention_norm.weight module.language_model.encoder.layers.5.self_attention.query_key_value.weight module.language_model.encoder.layers.16.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.11.self_attention.dense.weight module.language_model.encoder.layers.6.input_norm.bias module.language_model.encoder.layers.1.self_attention.query_key_value.bias module.language_model.encoder.layers.23.self_attention.dense.weight module.language_model.encoder.layers.17.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.12.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.7.self_attention.query_key_value.weight module.language_model.encoder.layers.2.input_norm.weight module.language_model.encoder.layers.3.input_norm.bias module.language_model.encoder.final_norm.weight module.language_model.encoder.layers.20.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.19.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.19.self_attention.query_key_value.bias module.language_model.encoder.layers.13.post_attention_norm.bias module.language_model.encoder.layers.8.self_attention.dense.bias module.language_model.encoder.layers.22.input_norm.bias module.language_model.encoder.layers.19.self_attention.dense.bias module.language_model.encoder.layers.14.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.9.post_attention_norm.weight module.language_model.encoder.layers.4.self_attention.query_key_value.weight module.language_model.encoder.layers.15.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.10.self_attention.dense.weight module.language_model.encoder.layers.5.post_attention_norm.weight module.language_model.encoder.layers.22.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.16.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.11.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.6.self_attention.query_key_value.weight module.language_model.encoder.layers.1.input_norm.weight module.language_model.encoder.layers.23.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.20.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.18.self_attention.query_key_value.bias module.language_model.encoder.layers.12.post_attention_norm.bias module.language_model.encoder.layers.7.self_attention.dense.bias module.language_model.encoder.layers.2.input_norm.bias total number of elements: 354871296 learning rate decay style: cosine [after model, optimizer, and learning rate scheduler are built] datetime: 2023-09-22 21:36:50 building train, validation, and test datasets ... datasets target sizes (minimum size): train: 32000000 validation: 1282560 test: 2560 building train, validation, and test datasets for GPT ... Single data path provided for train, valid & test building dataset index ... reading sequence lengths... reading sequence pointers... reading document indices... creating np buffer of mmap... creating memory view of np buffer... finished creating indexed dataset in 0.001909 seconds number of documents: 79000 dataset split: train: document indices in [0, 74971) total of 74971 documents validation: document indices in [74971, 78921) total of 3950 documents test: document indices in [78921, 79000) total of 79 documents [173632d0f02d:986 :0:1411] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 1411) ==== 0 0x0000000000042520 sigaction() ???:0 1 0x00000000001afbba nss_database_lookup() ???:0 2 0x000000000008e1ca ncclGroupEnd() ???:0 3 0x000000000007cec7 ncclGroupEnd() ???:0 4 0x000000000007f232 ncclGroupEnd() ???:0 5 0x0000000000094b43 pthread_condattr_setpshared() ???:0 6 0x0000000000125bb4 clone() ???:0

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 984 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 985 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 987 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 988 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 989 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 990 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 991 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 2 (pid: 986) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/workspace/Megatron-LM/pretrain_gpt.py FAILED

Failures:

--------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-09-22_21:37:03 host : 173632d0f02d rank : 2 (local_rank: 2) exitcode : -7 (pid: 986) error_file: traceback : Signal 7 (SIGBUS) received by PID 986 ===================================================

Environment (please complete the following information):

Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context Add any other context about the problem here.

deepakn94 commented 1 year ago

Are you seeing these errors within a particular docker container?

awsankur commented 1 year ago

Yes . Getting these errors in the container nvcr.io/nvidia/pytorch:23.05-py3

deepakn94 commented 1 year ago

Can you check if you get the same error with nvcr.io/nvidia/pytorch:23.04-py3?

awsankur commented 1 year ago

Just tried it. I get the exact same error.

deepakn94 commented 1 year ago

Interesting, this works for us locally.

I think this is related to your NCCL setup. Are you able to run nccl_tests in the same setup? Or something simple that uses torch.distributed: https://pytorch.org/tutorials/intermediate/dist_tuto.html#setup.

awsankur commented 1 year ago

I am able to run NCCL tests on my node. Here is the result I get:

root@f218865125ae:/opt/nccl-tests/build# ./all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     66 on f218865125ae device  0 [0x10] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid     66 on f218865125ae device  1 [0x10] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid     66 on f218865125ae device  2 [0x20] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid     66 on f218865125ae device  3 [0x20] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid     66 on f218865125ae device  4 [0x90] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid     66 on f218865125ae device  5 [0x90] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid     66 on f218865125ae device  6 [0xa0] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid     66 on f218865125ae device  7 [0xa0] NVIDIA A100-SXM4-80GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    75.64    0.00    0.00      0    74.20    0.00    0.00      0
          16             4     float     sum      -1    73.47    0.00    0.00      0    73.32    0.00    0.00      0
          32             8     float     sum      -1    73.25    0.00    0.00      0    74.40    0.00    0.00      0
          64            16     float     sum      -1    74.39    0.00    0.00      0    73.35    0.00    0.00      0
         128            32     float     sum      -1    73.84    0.00    0.00      0    73.68    0.00    0.00      0
         256            64     float     sum      -1    74.51    0.00    0.01      0    74.40    0.00    0.01      0
         512           128     float     sum      -1    73.45    0.01    0.01      0    73.21    0.01    0.01      0
        1024           256     float     sum      -1    75.96    0.01    0.02      0    76.13    0.01    0.02      0
        2048           512     float     sum      -1    86.42    0.02    0.04      0    83.34    0.02    0.04      0
        4096          1024     float     sum      -1    93.53    0.04    0.08      0    91.42    0.04    0.08      0
        8192          2048     float     sum      -1    94.95    0.09    0.15      0    94.23    0.09    0.15      0
       16384          4096     float     sum      -1    97.16    0.17    0.30      0    97.85    0.17    0.29      0
       32768          8192     float     sum      -1    111.5    0.29    0.51      0    111.8    0.29    0.51      0
       65536         16384     float     sum      -1    117.1    0.56    0.98      0    116.0    0.56    0.99      0
      131072         32768     float     sum      -1    124.3    1.05    1.84      0    123.7    1.06    1.85      0
      262144         65536     float     sum      -1    126.0    2.08    3.64      0    127.2    2.06    3.61      0
      524288        131072     float     sum      -1    138.8    3.78    6.61      0    132.1    3.97    6.95      0
     1048576        262144     float     sum      -1    141.3    7.42   12.99      0    143.7    7.30   12.77      0
     2097152        524288     float     sum      -1    152.4   13.76   24.09      0    152.6   13.74   24.05      0
     4194304       1048576     float     sum      -1    169.6   24.73   43.27      0    167.8   25.00   43.75      0
     8388608       2097152     float     sum      -1    190.1   44.13   77.23      0    192.5   43.59   76.28      0
    16777216       4194304     float     sum      -1    221.4   75.79  132.63      0    217.2   77.23  135.15      0
    33554432       8388608     float     sum      -1    356.8   94.05  164.59      0    355.7   94.33  165.08      0
    67108864      16777216     float     sum      -1    574.8  116.76  204.32      0    574.5  116.80  204.41      0
   134217728      33554432     float     sum      -1   1158.2  115.89  202.81      0   1154.2  116.29  203.51      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 35.1127
#

I am also able to train another another model with DDP using torchrun without any issues. The issue arises only when running MegatronLM code. Since it works for you locally, how can I help you debug this?

deepakn94 commented 1 year ago

Can you run your Megatron command with NCCL_DEBUG=INFO and send the logfile here?

awsankur commented 1 year ago

Here it is:

root@09677202c889:/workspace# torchrun --standalone --nnodes=1 --nproc_per_node=8 /workspace/Megatron-LM/pretrain_gpt.py --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 8 --global-batch-size 64 --lr 0.00015 --train-iters 500000 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --lr-warmup-fraction .01 --clip-grad 1.0 --fp16 --data-path /data/gpt2/my-gpt2_text_document --vocab-file /data/gpt2/gpt2-vocab.json --merge-file /data/gpt2/gpt2-merges.txt --split 949,50,1 --log-interval 1 --save-interval 10000 --eval-interval 1000 --eval-iters 40 master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages using world size: 8, data-parallel-size: 8, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 add_bias_linear ................................. True add_position_embedding .......................... True adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_layernorm_1p .............................. False apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False async_tensor_model_parallel_allreduce ........... True attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False barrier_with_L1_time ............................ True bert_binary_head ................................ True bert_embedder_type .............................. megatron bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None check_for_nan_in_loss_and_grad .................. True classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 data_cache_path ................................. None data_parallel_random_init ....................... False data_parallel_size .............................. 8 data_path ....................................... ['/data/gpt2/my-gpt2_text_document'] data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single decoder_num_layers .............................. None decoder_seq_length .............................. None dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 distribute_saved_activations .................... False distributed_backend ............................. nccl distributed_timeout_minutes ..................... 10 embedding_path .................................. None embedding_weights_in_fp32 ....................... False empty_unused_memory_level ....................... 0 encoder_num_layers .............................. 24 encoder_seq_length .............................. 1024 end_weight_decay ................................ 0.01 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 40 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_on_missing_checkpoint ...................... False exit_signal_handler ............................. False ffn_hidden_size ................................. 4096 finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8 ............................................. None fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 64 gradient_accumulation_fusion .................... True group_query_attention ........................... False head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.1 hidden_size ..................................... 1024 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 iter_per_epoch .................................. 1250 kv_channels ..................................... 64 lazy_mpu_init ................................... None load ............................................ None local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 1 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.00015 lr_decay_iters .................................. 320000 lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_warmup_fraction .............................. 0.01 lr_warmup_init .................................. 0.0 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... True max_position_embeddings ......................... 1024 max_tokens_to_oom ............................... 12000 merge_file ...................................... /data/gpt2/gpt2-merges.txt micro_batch_size ................................ 8 min_loss_scale .................................. 1.0 min_lr .......................................... 1e-05 mmap_warmup ..................................... False no_load_optim ................................... None no_load_rng ..................................... None no_persist_layer_norm ........................... False no_save_optim ................................... None no_save_rng ..................................... None norm_epsilon .................................... 1e-05 normalization ................................... LayerNorm num_attention_heads ............................. 16 num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... None num_layers ...................................... 24 num_layers_per_virtual_pipeline_stage ........... None num_query_groups ................................ 1 num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam output_bert_embeddings .......................... False overlap_grad_reduce ............................. False overlap_p2p_comm ................................ False override_opt_param_scheduler .................... False params_dtype .................................... torch.float16 patch_dim ....................................... 16 perform_initialization .......................... True pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... learned_absolute profile ......................................... False profile_ranks ................................... [0] profile_step_end ................................ 12 profile_step_start .............................. 10 query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ None reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 retro_add_retriever ............................. False retro_cyclic_train_iters ........................ None retro_encoder_attention_dropout ................. 0.1 retro_encoder_hidden_dropout .................... 0.1 retro_encoder_layers ............................ 2 retro_num_neighbors ............................. 2 retro_num_retrieved_chunks ...................... 2 retro_return_doc_ids ............................ False retro_workdir ................................... None rotary_percent .................................. 1.0 rotary_seq_len_interpolation_factor ............. None sample_rate ..................................... 1.0 save ............................................ None save_interval ................................... 10000 scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 1024 sequence_parallel ............................... False sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 skip_train ...................................... False split ........................................... 949,50,1 squared_relu .................................... False standalone_embedding_stage ...................... False start_weight_decay .............................. 0.01 swiglu .......................................... False swin_backbone_type .............................. tiny tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None timing_log_level ................................ 0 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_model ................................. None tokenizer_type .................................. GPT2BPETokenizer train_data_path ................................. None train_iters ..................................... 500000 train_samples ................................... None transformer_impl ................................ local transformer_pipeline_model_parallel_size ........ 1 untie_embeddings_and_output_weights ............. False use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_cpu_initialization .......................... None use_distributed_optimizer ....................... False use_flash_attn .................................. False use_one_sent_docs ............................... False use_ring_exchange_p2p ........................... False use_rotary_position_embeddings .................. False valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... /data/gpt2/gpt2-vocab.json vocab_size ...................................... None weight_decay .................................... 0.01 weight_decay_incr_style ......................... constant world_size ...................................... 8 -------------------- end of arguments --------------------- setting number of micro-batches to constant 1

building GPT2BPETokenizer tokenizer ... padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) initializing torch distributed ... initialized tensor model parallel with size 1 initialized pipeline model parallel with size 1 setting random seeds to 1234 ... compiling dataset index builder ... make: Entering directory '/workspace/Megatron-LM/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/workspace/Megatron-LM/megatron/data'

done with dataset index builder. Compilation time: 0.075 seconds compiling and loading fused kernels ... 09677202c889:5779:5779 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5779:5779 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5779:5779 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5779:5779 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5779:5779 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5779:5779 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.17.1+cuda12.1 09677202c889:5780:5780 [1] NCCL INFO cudaDriverVersion 12020 09677202c889:5780:5780 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5787:5787 [7] NCCL INFO cudaDriverVersion 12020 09677202c889:5787:5787 [7] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5783:5783 [4] NCCL INFO cudaDriverVersion 12020 09677202c889:5783:5783 [4] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5785:5785 [6] NCCL INFO cudaDriverVersion 12020 09677202c889:5782:5782 [3] NCCL INFO cudaDriverVersion 12020 09677202c889:5785:5785 [6] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5782:5782 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5781:5781 [2] NCCL INFO cudaDriverVersion 12020 09677202c889:5784:5784 [5] NCCL INFO cudaDriverVersion 12020 09677202c889:5781:5781 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5784:5784 [5] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5783:5783 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5783:5783 [4] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5783:5783 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5783:5783 [4] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5785:5785 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5785:5785 [6] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5782:5782 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5785:5785 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5785:5785 [6] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5782:5782 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5782:5782 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5782:5782 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5781:5781 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5781:5781 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5781:5781 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5781:5781 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5787:5787 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5787:5787 [7] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5787:5787 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5787:5787 [7] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5780:5780 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5780:5780 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5780:5780 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5780:5780 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5784:5784 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5784:5784 [5] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5784:5784 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5784:5784 [5] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5779:6160 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5779:6160 [0] NCCL INFO P2P plugin IBext 09677202c889:5779:6160 [0] NCCL INFO NET/IB : No device found. 09677202c889:5779:6160 [0] NCCL INFO NET/IB : No device found. 09677202c889:5779:6160 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5779:6160 [0] NCCL INFO Using network Socket 09677202c889:5783:6165 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5783:6165 [4] NCCL INFO P2P plugin IBext 09677202c889:5783:6165 [4] NCCL INFO NET/IB : No device found. 09677202c889:5783:6165 [4] NCCL INFO NET/IB : No device found. 09677202c889:5783:6165 [4] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5783:6165 [4] NCCL INFO Using network Socket 09677202c889:5781:6168 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5781:6168 [2] NCCL INFO P2P plugin IBext 09677202c889:5781:6168 [2] NCCL INFO NET/IB : No device found. 09677202c889:5781:6168 [2] NCCL INFO NET/IB : No device found. 09677202c889:5781:6168 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5781:6168 [2] NCCL INFO Using network Socket 09677202c889:5787:6171 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5787:6171 [7] NCCL INFO P2P plugin IBext 09677202c889:5787:6171 [7] NCCL INFO NET/IB : No device found. 09677202c889:5787:6171 [7] NCCL INFO NET/IB : No device found. 09677202c889:5787:6171 [7] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5787:6171 [7] NCCL INFO Using network Socket 09677202c889:5782:6167 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5782:6167 [3] NCCL INFO P2P plugin IBext 09677202c889:5782:6167 [3] NCCL INFO NET/IB : No device found. 09677202c889:5782:6167 [3] NCCL INFO NET/IB : No device found. 09677202c889:5782:6167 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5782:6167 [3] NCCL INFO Using network Socket 09677202c889:5780:6172 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5780:6172 [1] NCCL INFO P2P plugin IBext 09677202c889:5780:6172 [1] NCCL INFO NET/IB : No device found. 09677202c889:5780:6172 [1] NCCL INFO NET/IB : No device found. 09677202c889:5780:6172 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5780:6172 [1] NCCL INFO Using network Socket 09677202c889:5785:6166 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5785:6166 [6] NCCL INFO P2P plugin IBext 09677202c889:5785:6166 [6] NCCL INFO NET/IB : No device found. 09677202c889:5785:6166 [6] NCCL INFO NET/IB : No device found. 09677202c889:5785:6166 [6] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5785:6166 [6] NCCL INFO Using network Socket 09677202c889:5784:6174 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5784:6174 [5] NCCL INFO P2P plugin IBext 09677202c889:5784:6174 [5] NCCL INFO NET/IB : No device found. 09677202c889:5784:6174 [5] NCCL INFO NET/IB : No device found. 09677202c889:5784:6174 [5] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5784:6174 [5] NCCL INFO Using network Socket 09677202c889:5787:6171 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000 09677202c889:5784:6174 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000 09677202c889:5783:6165 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000 09677202c889:5782:6167 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff 09677202c889:5780:6172 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff 09677202c889:5781:6168 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff 09677202c889:5779:6160 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff 09677202c889:5785:6166 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000 09677202c889:5779:6160 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 09677202c889:5787:6171 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 09677202c889:5785:6166 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 09677202c889:5779:6160 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 09677202c889:5784:6174 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 09677202c889:5787:6171 [7] NCCL INFO P2P Chunksize set to 524288 09677202c889:5785:6166 [6] NCCL INFO P2P Chunksize set to 524288 09677202c889:5779:6160 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 09677202c889:5780:6172 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 09677202c889:5783:6165 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 09677202c889:5781:6168 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 09677202c889:5784:6174 [5] NCCL INFO P2P Chunksize set to 524288 09677202c889:5779:6160 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 09677202c889:5780:6172 [1] NCCL INFO P2P Chunksize set to 524288 09677202c889:5782:6167 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 09677202c889:5783:6165 [4] NCCL INFO P2P Chunksize set to 524288 09677202c889:5781:6168 [2] NCCL INFO P2P Chunksize set to 524288 09677202c889:5779:6160 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 09677202c889:5782:6167 [3] NCCL INFO P2P Chunksize set to 524288 09677202c889:5779:6160 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 09677202c889:5779:6160 [0] NCCL INFO P2P Chunksize set to 524288 09677202c889:6257:6638 [6] NCCL INFO Channel 00/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 00/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 00/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 00/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 00/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 00/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 00/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 00/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 01/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 01/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 01/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 01/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 01/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 01/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 01/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 01/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 02/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 02/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 02/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 02/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 02/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 02/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 02/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 02/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 03/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 03/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 03/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 03/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 03/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 03/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 03/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 03/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 04/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 04/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 04/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 04/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 04/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 04/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 04/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 04/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 05/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 05/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 05/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 05/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 05/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 05/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 05/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 06/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 06/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 06/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 06/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 06/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 05/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 06/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 07/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 06/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 07/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 07/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 07/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 07/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 06/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 07/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 08/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 08/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 08/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 07/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 08/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 07/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 08/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 08/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 09/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 09/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 09/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 09/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 09/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 08/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 09/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 08/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 10/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 10/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 10/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 10/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 10/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 09/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 09/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 10/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 11/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 11/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 11/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 11/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 11/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 10/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 11/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 10/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 12/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 12/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 12/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 12/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 12/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 11/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 12/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 11/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 13/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 13/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 13/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 13/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 13/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 13/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 12/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 12/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 14/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 14/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 14/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 14/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 14/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 14/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 13/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 13/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 15/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 15/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 15/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 15/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 15/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 15/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 14/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 14/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 16/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 16/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 16/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 16/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 16/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 16/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 15/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 15/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 17/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 17/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 17/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 17/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 17/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 17/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 16/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 16/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 18/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 18/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 18/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 18/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 18/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 18/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 17/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 17/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 19/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 19/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 19/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 19/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 19/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 19/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 18/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 18/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 20/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 20/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 20/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 20/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 20/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 20/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 19/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 19/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 21/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 21/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 21/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 21/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 21/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 21/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 20/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 22/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 22/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 22/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 20/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 22/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 22/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 22/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 23/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 21/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 23/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 23/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 21/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 23/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 23/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 23/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 22/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 22/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Connected all rings 09677202c889:6254:6636 [3] NCCL INFO Connected all rings 09677202c889:6251:6632 [0] NCCL INFO Channel 23/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 23/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Connected all rings 09677202c889:6256:6637 [5] NCCL INFO Connected all rings 09677202c889:6259:6640 [7] NCCL INFO Connected all rings 09677202c889:6259:6640 [7] NCCL INFO Channel 00/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Connected all rings 09677202c889:6255:6646 [4] NCCL INFO Connected all rings 09677202c889:6257:6638 [6] NCCL INFO Connected all rings 09677202c889:6259:6640 [7] NCCL INFO Channel 01/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 02/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 03/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 04/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 05/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 06/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 07/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 08/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 09/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 10/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 11/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 12/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 13/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 14/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 15/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 16/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 17/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 18/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 19/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 00/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 00/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 20/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 01/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 01/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 21/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 02/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 02/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 22/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 00/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 00/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 03/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 03/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 23/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 00/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 01/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 01/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 04/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 04/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 00/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 02/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 01/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 02/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 05/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 05/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 03/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 01/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 03/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 02/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 06/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 06/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 04/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 02/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 04/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 03/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 07/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 07/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 03/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 05/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 05/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 08/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 08/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 04/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 06/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 04/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 06/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 09/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 09/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 05/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 07/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 05/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 07/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 10/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 10/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 06/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 06/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 08/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 08/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 11/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 11/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 07/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 09/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 07/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 09/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 12/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 12/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 08/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 10/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 08/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 10/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 13/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 13/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 09/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 11/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 09/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 11/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 14/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 14/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 10/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 12/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 10/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 12/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 15/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 15/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 11/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 13/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 11/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 13/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 16/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 16/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 12/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 12/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 14/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 14/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 17/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 17/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 13/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 13/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 15/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 15/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 18/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 18/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 14/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 14/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 16/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 16/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 19/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 19/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 17/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 15/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 15/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 17/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 20/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 20/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 18/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 16/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 18/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 16/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 21/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 21/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 19/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 17/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 19/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 17/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 22/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 22/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 20/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 18/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 20/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 18/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 23/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 23/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 21/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 19/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 21/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 19/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 20/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 22/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 22/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 20/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 21/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 23/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 23/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 21/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 22/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 22/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5275:5656 [0] NCCL INFO Connected all trees 09677202c889:5275:5656 [0] NCCL INFO NVLS multicast support is not available on dev 0 09677202c889:5275:5656 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5276:5662 [1] NCCL INFO Connected all trees 09677202c889:5276:5662 [1] NCCL INFO NVLS multicast support is not available on dev 1 09677202c889:5276:5662 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5278:5664 [3] NCCL INFO Connected all trees 09677202c889:5278:5664 [3] NCCL INFO NVLS multicast support is not available on dev 3 09677202c889:5278:5664 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5276:5662 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5278:5664 [3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5275:5656 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5277:5663 [2] NCCL INFO Connected all trees 09677202c889:5277:5663 [2] NCCL INFO NVLS multicast support is not available on dev 2 09677202c889:5277:5663 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5277:5663 [2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5281:5669 [6] NCCL INFO Channel 23/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5665 [5] NCCL INFO Channel 23/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5283:5670 [7] NCCL INFO Connected all trees 09677202c889:5283:5670 [7] NCCL INFO NVLS multicast support is not available on dev 7 09677202c889:5283:5670 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5283:5670 [7] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5280:5665 [5] NCCL INFO Connected all trees 09677202c889:5280:5665 [5] NCCL INFO NVLS multicast support is not available on dev 5 09677202c889:5280:5665 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5280:5665 [5] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5281:5669 [6] NCCL INFO Connected all trees 09677202c889:5281:5669 [6] NCCL INFO NVLS multicast support is not available on dev 6 09677202c889:5281:5669 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5281:5669 [6] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5279:5666 [4] NCCL INFO Connected all trees 09677202c889:5279:5666 [4] NCCL INFO NVLS multicast support is not available on dev 4 09677202c889:5279:5666 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5279:5666 [4] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5281:5669 [6] NCCL INFO comm 0x83a93c0 rank 6 nranks 8 cudaDev 6 busId a01c0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5283:5670 [7] NCCL INFO comm 0x7b3cf40 rank 7 nranks 8 cudaDev 7 busId a01d0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5277:5663 [2] NCCL INFO comm 0x8a2f6e0 rank 2 nranks 8 cudaDev 2 busId 201c0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5278:5664 [3] NCCL INFO comm 0x9104200 rank 3 nranks 8 cudaDev 3 busId 201d0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5276:5662 [1] NCCL INFO comm 0x8a3b120 rank 1 nranks 8 cudaDev 1 busId 101d0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5280:5665 [5] NCCL INFO comm 0x853d0a0 rank 5 nranks 8 cudaDev 5 busId 901d0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5279:5666 [4] NCCL INFO comm 0x928b5a0 rank 4 nranks 8 cudaDev 4 busId 901c0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5275:5656 [0] NCCL INFO comm 0x8dee0a0 rank 0 nranks 8 cudaDev 0 busId 101c0 commId 0x15f48e7ced62c065 - Init COMPLETE done with compiling and loading fused kernels. Compilation time: 6.855 seconds time to initialize megatron (seconds): 10.024 [after megatron is initialized] datetime: 2023-09-24 17:31:28 building GPT model ... number of parameters on (tensor, pipeline) model parallel rank (0, 0): 354871296 buckets for gradient all-reduce: params for bucket 1 module.language_model.encoder.layers.22.self_attention.query_key_value.bias module.language_model.encoder.layers.17.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.13.post_attention_norm.bias module.language_model.encoder.layers.9.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.5.self_attention.dense.bias module.language_model.encoder.layers.0.input_norm.weight module.language_model.encoder.layers.1.self_attention.query_key_value.weight module.language_model.encoder.layers.23.input_norm.weight module.language_model.encoder.layers.18.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.14.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.10.post_attention_norm.weight module.language_model.encoder.layers.6.self_attention.dense.weight module.language_model.encoder.layers.2.self_attention.query_key_value.weight module.language_model.encoder.layers.0.post_attention_norm.weight module.language_model.encoder.final_norm.bias module.language_model.encoder.layers.20.self_attention.query_key_value.bias module.language_model.encoder.layers.15.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.11.post_attention_norm.bias module.language_model.encoder.layers.7.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.3.self_attention.dense.bias module.language_model.encoder.layers.0.self_attention.dense.bias module.language_model.encoder.layers.21.self_attention.query_key_value.bias module.language_model.encoder.layers.16.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.12.post_attention_norm.bias module.language_model.encoder.layers.8.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.4.self_attention.dense.bias module.language_model.encoder.layers.0.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.22.input_norm.weight module.language_model.encoder.layers.17.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.13.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.9.post_attention_norm.weight module.language_model.encoder.layers.5.self_attention.dense.weight module.language_model.encoder.layers.1.self_attention.dense.bias module.language_model.encoder.layers.23.input_norm.bias module.language_model.encoder.layers.19.self_attention.query_key_value.bias module.language_model.encoder.layers.14.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.10.post_attention_norm.bias module.language_model.encoder.layers.6.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.2.self_attention.dense.bias module.language_model.encoder.layers.0.self_attention.query_key_value.weight module.language_model.encoder.layers.20.input_norm.weight module.language_model.encoder.layers.15.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.11.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.7.post_attention_norm.weight module.language_model.encoder.layers.3.self_attention.dense.weight module.language_model.encoder.layers.21.input_norm.weight module.language_model.encoder.layers.16.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.12.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.8.post_attention_norm.weight module.language_model.encoder.layers.4.self_attention.dense.weight module.language_model.encoder.layers.22.input_norm.bias module.language_model.encoder.layers.18.self_attention.query_key_value.bias module.language_model.encoder.layers.13.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.9.post_attention_norm.bias module.language_model.encoder.layers.5.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.0.mlp.dense_h_to_4h.weight module.language_model.embedding.word_embeddings.weight module.language_model.encoder.layers.23.self_attention.query_key_value.weight module.language_model.encoder.layers.19.input_norm.weight module.language_model.encoder.layers.14.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.10.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.6.post_attention_norm.weight module.language_model.encoder.layers.2.self_attention.dense.weight module.language_model.encoder.layers.0.input_norm.bias module.language_model.encoder.layers.20.input_norm.bias module.language_model.encoder.layers.16.self_attention.query_key_value.bias module.language_model.encoder.layers.11.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.7.post_attention_norm.bias module.language_model.encoder.layers.3.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.21.input_norm.bias module.language_model.encoder.layers.17.self_attention.query_key_value.bias module.language_model.encoder.layers.12.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.8.post_attention_norm.bias module.language_model.encoder.layers.4.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.1.self_attention.dense.weight module.language_model.encoder.layers.22.self_attention.query_key_value.weight module.language_model.encoder.layers.18.input_norm.weight module.language_model.encoder.layers.13.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.9.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.5.post_attention_norm.weight module.language_model.encoder.layers.0.self_attention.query_key_value.bias module.language_model.encoder.layers.23.self_attention.dense.bias module.language_model.encoder.layers.19.input_norm.bias module.language_model.encoder.layers.15.self_attention.query_key_value.bias module.language_model.encoder.layers.10.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.6.post_attention_norm.bias module.language_model.encoder.layers.2.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.1.input_norm.weight module.language_model.encoder.layers.20.self_attention.query_key_value.weight module.language_model.encoder.layers.16.input_norm.weight module.language_model.encoder.layers.11.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.7.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.3.post_attention_norm.weight module.language_model.encoder.layers.21.self_attention.query_key_value.weight module.language_model.encoder.layers.17.input_norm.weight module.language_model.encoder.layers.12.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.8.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.4.post_attention_norm.weight module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.22.self_attention.dense.bias module.language_model.encoder.layers.18.input_norm.bias module.language_model.encoder.layers.14.self_attention.query_key_value.bias module.language_model.encoder.layers.9.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.5.post_attention_norm.bias module.language_model.encoder.layers.1.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.23.self_attention.dense.weight module.language_model.encoder.layers.19.self_attention.query_key_value.weight module.language_model.encoder.layers.15.input_norm.weight module.language_model.encoder.layers.10.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.6.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.2.post_attention_norm.weight module.language_model.encoder.layers.20.self_attention.dense.bias module.language_model.encoder.layers.16.input_norm.bias module.language_model.encoder.layers.12.self_attention.query_key_value.bias module.language_model.encoder.layers.7.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.3.post_attention_norm.bias module.language_model.encoder.layers.0.post_attention_norm.bias module.language_model.encoder.layers.21.self_attention.dense.bias module.language_model.encoder.layers.17.input_norm.bias module.language_model.encoder.layers.13.self_attention.query_key_value.bias module.language_model.encoder.layers.8.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.4.post_attention_norm.bias module.language_model.encoder.layers.1.post_attention_norm.weight module.language_model.encoder.layers.22.self_attention.dense.weight module.language_model.encoder.layers.18.self_attention.query_key_value.weight module.language_model.encoder.layers.14.input_norm.weight module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.5.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.23.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.19.self_attention.dense.bias module.language_model.encoder.layers.15.input_norm.bias module.language_model.encoder.layers.11.self_attention.query_key_value.bias module.language_model.encoder.layers.6.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.2.post_attention_norm.bias module.language_model.encoder.layers.0.self_attention.dense.weight module.language_model.encoder.layers.20.self_attention.dense.weight module.language_model.encoder.layers.16.self_attention.query_key_value.weight module.language_model.encoder.layers.12.input_norm.weight module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.3.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.21.self_attention.dense.weight module.language_model.encoder.layers.17.self_attention.query_key_value.weight module.language_model.encoder.layers.13.input_norm.weight module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.4.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.22.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.18.self_attention.dense.bias module.language_model.encoder.layers.14.input_norm.bias module.language_model.encoder.layers.10.self_attention.query_key_value.bias module.language_model.encoder.layers.5.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.1.post_attention_norm.bias module.language_model.encoder.layers.23.post_attention_norm.weight module.language_model.encoder.layers.19.self_attention.dense.weight module.language_model.encoder.layers.15.self_attention.query_key_value.weight module.language_model.encoder.layers.11.input_norm.weight module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.2.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.20.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.16.self_attention.dense.bias module.language_model.encoder.layers.12.input_norm.bias module.language_model.encoder.layers.8.self_attention.query_key_value.bias module.language_model.encoder.layers.3.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.21.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.17.self_attention.dense.bias module.language_model.encoder.layers.13.input_norm.bias module.language_model.encoder.layers.9.self_attention.query_key_value.bias module.language_model.encoder.layers.4.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.22.post_attention_norm.weight module.language_model.encoder.layers.18.self_attention.dense.weight module.language_model.encoder.layers.14.self_attention.query_key_value.weight module.language_model.encoder.layers.10.input_norm.weight module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.1.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.23.post_attention_norm.bias module.language_model.encoder.layers.19.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.15.self_attention.dense.bias module.language_model.encoder.layers.11.input_norm.bias module.language_model.encoder.layers.7.self_attention.query_key_value.bias module.language_model.encoder.layers.2.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.20.post_attention_norm.weight module.language_model.encoder.layers.16.self_attention.dense.weight module.language_model.encoder.layers.12.self_attention.query_key_value.weight module.language_model.encoder.layers.8.input_norm.weight module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.0.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.21.post_attention_norm.weight module.language_model.encoder.layers.17.self_attention.dense.weight module.language_model.encoder.layers.13.self_attention.query_key_value.weight module.language_model.encoder.layers.9.input_norm.weight module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.1.input_norm.bias module.language_model.encoder.layers.22.post_attention_norm.bias module.language_model.encoder.layers.18.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.14.self_attention.dense.bias module.language_model.encoder.layers.10.input_norm.bias module.language_model.encoder.layers.6.self_attention.query_key_value.bias module.language_model.encoder.layers.1.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.23.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.19.post_attention_norm.weight module.language_model.encoder.layers.15.self_attention.dense.weight module.language_model.encoder.layers.11.self_attention.query_key_value.weight module.language_model.encoder.layers.7.input_norm.weight module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.20.post_attention_norm.bias module.language_model.encoder.layers.16.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.12.self_attention.dense.bias module.language_model.encoder.layers.8.input_norm.bias module.language_model.encoder.layers.4.self_attention.query_key_value.bias module.language_model.encoder.layers.21.post_attention_norm.bias module.language_model.encoder.layers.17.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.13.self_attention.dense.bias module.language_model.encoder.layers.9.input_norm.bias module.language_model.encoder.layers.5.self_attention.query_key_value.bias module.language_model.encoder.layers.22.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.18.post_attention_norm.weight module.language_model.encoder.layers.14.self_attention.dense.weight module.language_model.encoder.layers.10.self_attention.query_key_value.weight module.language_model.encoder.layers.6.input_norm.weight module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.3.self_attention.query_key_value.bias module.language_model.encoder.layers.23.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.19.post_attention_norm.bias module.language_model.encoder.layers.15.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.11.self_attention.dense.bias module.language_model.encoder.layers.7.input_norm.bias module.language_model.encoder.layers.20.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.16.post_attention_norm.weight module.language_model.encoder.layers.12.self_attention.dense.weight module.language_model.encoder.layers.8.self_attention.query_key_value.weight module.language_model.encoder.layers.4.input_norm.weight module.language_model.encoder.layers.21.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.17.post_attention_norm.weight module.language_model.encoder.layers.13.self_attention.dense.weight module.language_model.encoder.layers.9.self_attention.query_key_value.weight module.language_model.encoder.layers.5.input_norm.weight module.language_model.encoder.layers.22.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.18.post_attention_norm.bias module.language_model.encoder.layers.14.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.10.self_attention.dense.bias module.language_model.encoder.layers.6.input_norm.bias module.language_model.encoder.layers.2.self_attention.query_key_value.bias module.language_model.encoder.layers.23.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.19.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.15.post_attention_norm.weight module.language_model.encoder.layers.11.self_attention.dense.weight module.language_model.encoder.layers.7.self_attention.query_key_value.weight module.language_model.encoder.layers.3.input_norm.weight module.language_model.encoder.layers.20.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.16.post_attention_norm.bias module.language_model.encoder.layers.12.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.8.self_attention.dense.bias module.language_model.encoder.layers.4.input_norm.bias module.language_model.encoder.layers.1.self_attention.query_key_value.bias module.language_model.encoder.layers.21.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.17.post_attention_norm.bias module.language_model.encoder.layers.13.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.9.self_attention.dense.bias module.language_model.encoder.layers.5.input_norm.bias module.language_model.encoder.layers.22.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.18.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.14.post_attention_norm.weight module.language_model.encoder.layers.10.self_attention.dense.weight module.language_model.encoder.layers.6.self_attention.query_key_value.weight module.language_model.encoder.layers.2.input_norm.weight module.language_model.encoder.layers.19.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.15.post_attention_norm.bias module.language_model.encoder.layers.11.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.7.self_attention.dense.bias module.language_model.encoder.layers.3.input_norm.bias module.language_model.encoder.layers.20.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.16.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.12.post_attention_norm.weight module.language_model.encoder.layers.8.self_attention.dense.weight module.language_model.encoder.layers.4.self_attention.query_key_value.weight module.language_model.embedding.position_embeddings.weight module.language_model.encoder.layers.21.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.17.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.13.post_attention_norm.weight module.language_model.encoder.layers.9.self_attention.dense.weight module.language_model.encoder.layers.5.self_attention.query_key_value.weight module.language_model.encoder.layers.23.self_attention.query_key_value.bias module.language_model.encoder.layers.18.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.14.post_attention_norm.bias module.language_model.encoder.layers.10.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.6.self_attention.dense.bias module.language_model.encoder.layers.2.input_norm.bias module.language_model.encoder.layers.3.self_attention.query_key_value.weight module.language_model.encoder.final_norm.weight module.language_model.encoder.layers.19.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.15.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.11.post_attention_norm.weight module.language_model.encoder.layers.7.self_attention.dense.weight total number of elements: 354871296 learning rate decay style: cosine [after model, optimizer, and learning rate scheduler are built] datetime: 2023-09-24 17:31:28 building train, validation, and test datasets ... datasets target sizes (minimum size): train: 32000000 validation: 1282560 test: 2560 building train, validation, and test datasets for GPT ... Single data path provided for train, valid & test building dataset index ... reading sequence lengths... reading sequence pointers... reading document indices... creating np buffer of mmap... creating memory view of np buffer... finished creating indexed dataset in 0.002253 seconds number of documents: 79000 dataset split: train: document indices in [0, 74971) total of 74971 documents validation: document indices in [74971, 78921) total of 3950 documents test: document indices in [78921, 79000) total of 79 documents 09677202c889:5275:5695 [0] NCCL INFO Using network Socket 09677202c889:5276:5696 [1] NCCL INFO Using network Socket 09677202c889:5277:5697 [2] NCCL INFO Using network Socket 09677202c889:5281:5698 [6] NCCL INFO Using network Socket 09677202c889:5283:5699 [7] NCCL INFO Using network Socket 09677202c889:5280:5700 [5] NCCL INFO Using network Socket 09677202c889:5279:5701 [4] NCCL INFO Using network Socket 09677202c889:5278:5702 [3] NCCL INFO Using network Socket 09677202c889:5280:5700 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000 09677202c889:5283:5699 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000 09677202c889:5278:5702 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff 09677202c889:5281:5698 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000 09677202c889:5277:5697 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff 09677202c889:5275:5695 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff 09677202c889:5276:5696 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff 09677202c889:5279:5701 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000 09677202c889:5275:5695 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 09677202c889:5277:5697 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 09677202c889:5275:5695 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 09677202c889:5276:5696 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 09677202c889:5283:5699 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 09677202c889:5277:5697 [2] NCCL INFO P2P Chunksize set to 524288 09677202c889:5275:5695 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 09677202c889:5279:5701 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 09677202c889:5276:5696 [1] NCCL INFO P2P Chunksize set to 524288 09677202c889:5283:5699 [7] NCCL INFO P2P Chunksize set to 524288 09677202c889:5281:5698 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 09677202c889:5275:5695 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 09677202c889:5280:5700 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 09677202c889:5278:5702 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 09677202c889:5279:5701 [4] NCCL INFO P2P Chunksize set to 524288 09677202c889:5281:5698 [6] NCCL INFO P2P Chunksize set to 524288 09677202c889:5275:5695 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 09677202c889:5280:5700 [5] NCCL INFO P2P Chunksize set to 524288 09677202c889:5278:5702 [3] NCCL INFO P2P Chunksize set to 524288 09677202c889:5275:5695 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 09677202c889:5275:5695 [0] NCCL INFO P2P Chunksize set to 524288 09677202c889:5278:5702 [3] NCCL INFO Channel 00/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 00/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 00/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 00/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 00/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 01/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 01/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 00/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 01/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 00/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 00/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 01/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 01/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 02/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 02/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 01/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 01/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 02/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 01/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 02/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 03/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 03/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 02/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 02/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 03/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 03/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 02/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 02/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 03/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 04/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 04/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 03/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 04/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 03/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 04/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 03/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 05/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 05/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 04/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 04/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 05/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 05/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 04/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 04/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 06/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 06/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 05/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 06/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 05/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 05/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 05/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 06/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 07/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 07/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 06/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 07/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 06/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 06/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 06/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 07/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 08/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 08/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 07/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 08/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 07/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 07/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 07/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 08/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 09/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 09/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 08/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 09/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 08/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 08/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 08/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 09/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 10/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 10/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 09/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 10/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 09/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 09/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 09/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 11/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 11/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 10/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 10/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 11/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 10/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 10/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 10/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 12/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 12/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 11/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 11/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 12/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 11/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 11/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 11/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 13/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 13/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 12/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 12/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 13/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 12/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 14/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 14/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 12/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 12/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 13/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 13/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 14/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 15/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 15/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 13/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 13/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 13/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 14/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 15/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 14/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 16/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 16/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 14/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 14/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 14/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 15/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 16/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 15/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 17/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 17/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 15/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 15/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 15/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 16/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 17/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 16/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 18/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 18/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 16/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 16/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 16/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 17/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 18/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 17/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 19/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 19/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 17/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 17/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 17/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 18/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 19/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 18/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 20/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 20/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 18/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 18/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 20/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 18/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 19/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 19/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 21/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 21/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 19/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 21/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 19/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 19/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 20/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 20/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 22/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 22/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 20/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 22/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 20/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 20/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 21/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 21/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 23/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 23/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 21/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 23/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 21/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 21/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 22/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 22/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 22/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 22/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 22/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 23/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 23/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 23/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 23/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 23/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Connected all rings 09677202c889:5280:5700 [5] NCCL INFO Connected all rings 09677202c889:5281:5698 [6] NCCL INFO Connected all rings 09677202c889:5277:5697 [2] NCCL INFO Connected all rings 09677202c889:5278:5702 [3] NCCL INFO Connected all rings 09677202c889:5283:5699 [7] NCCL INFO Connected all rings 09677202c889:5283:5699 [7] NCCL INFO Channel 00/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Connected all rings 09677202c889:5276:5696 [1] NCCL INFO Connected all rings 09677202c889:5283:5699 [7] NCCL INFO Channel 01/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 02/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 03/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 04/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 05/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 06/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 07/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 08/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 09/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 10/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 11/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 12/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 13/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 14/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 15/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 16/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 17/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 18/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 19/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 20/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 21/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 22/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 00/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 23/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 00/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 01/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 01/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 00/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 00/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 02/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 00/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 02/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 00/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 01/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 01/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 03/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 01/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 03/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 01/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 02/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 02/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 04/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 04/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 02/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 03/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 02/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 03/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 05/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 05/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 03/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 04/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 03/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 04/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 06/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 06/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 04/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 05/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 04/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 05/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 07/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 07/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 06/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 05/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 05/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 06/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 08/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 08/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 07/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 06/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 07/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 06/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 09/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 09/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 08/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 08/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 07/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 07/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 10/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 10/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 09/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 09/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 08/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 08/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 11/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 11/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 10/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 10/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 09/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 12/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 09/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 12/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 11/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 11/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 10/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 13/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 10/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 13/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 12/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 12/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 14/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 11/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 11/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 14/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 13/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 13/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 15/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 12/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 12/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 15/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 14/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 14/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 16/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 13/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 16/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 13/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 15/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 15/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 17/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 14/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 17/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 14/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 16/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 16/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 18/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 15/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 18/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 15/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 17/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 17/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 19/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 19/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 16/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 16/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 18/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 20/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 18/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 20/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 17/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 17/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 19/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 21/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 19/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 21/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 18/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 18/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 20/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 22/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 20/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 22/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 19/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 19/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 21/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 23/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 21/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 23/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 20/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 20/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 22/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 22/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Connected all trees 09677202c889:5283:5699 [7] NCCL INFO NVLS multicast support is not available on dev 7 09677202c889:5283:5699 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5283:5699 [7] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5283 :0:5704] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5276:5696 [1] NCCL INFO Channel 21/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 23/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 21/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 23/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read ==== backtrace (tid: 5704) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0

09677202c889:5276:5696 [1] NCCL INFO Channel 22/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 22/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Connected all trees 09677202c889:5279:5701 [4] NCCL INFO NVLS multicast support is not available on dev 4 09677202c889:5279:5701 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5279:5701 [4] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5279 :0:5710] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5280:5700 [5] NCCL INFO Connected all trees 09677202c889:5280:5700 [5] NCCL INFO NVLS multicast support is not available on dev 5 09677202c889:5280:5700 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5280:5700 [5] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5280 :0:5703] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5281:5698 [6] NCCL INFO Connected all trees 09677202c889:5281:5698 [6] NCCL INFO NVLS multicast support is not available on dev 6 09677202c889:5281:5698 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5281:5698 [6] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5281 :0:5706] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5276:5696 [1] NCCL INFO Channel 23/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 23/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read ==== backtrace (tid: 5703) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0

==== backtrace (tid: 5706) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0

==== backtrace (tid: 5710) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0

09677202c889:5275:5695 [0] NCCL INFO Connected all trees 09677202c889:5275:5695 [0] NCCL INFO NVLS multicast support is not available on dev 0 09677202c889:5275:5695 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5275:5695 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5275 :0:5708] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5277:5697 [2] NCCL INFO Connected all trees 09677202c889:5277:5697 [2] NCCL INFO NVLS multicast support is not available on dev 2 09677202c889:5277:5697 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5277:5697 [2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5277 :0:5707] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5276:5696 [1] NCCL INFO Connected all trees 09677202c889:5276:5696 [1] NCCL INFO NVLS multicast support is not available on dev 1 09677202c889:5276:5696 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5276:5696 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5276 :0:5709] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5278:5702 [3] NCCL INFO Connected all trees 09677202c889:5278:5702 [3] NCCL INFO NVLS multicast support is not available on dev 3 09677202c889:5278:5702 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5278:5702 [3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5278 :0:5705] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 5707) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0

==== backtrace (tid: 5709) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0

==== backtrace (tid: 5705) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0

==== backtrace (tid: 5708) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 5275) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/workspace/Megatron-LM/pretrain_gpt.py FAILED

Failures: [1]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 1 (local_rank: 1) exitcode : -7 (pid: 5276) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5276 [2]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 2 (local_rank: 2) exitcode : -7 (pid: 5277) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5277 [3]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 3 (local_rank: 3) exitcode : -7 (pid: 5278) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5278 [4]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 4 (local_rank: 4) exitcode : -7 (pid: 5279) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5279 [5]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 5 (local_rank: 5) exitcode : -7 (pid: 5280) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5280 [6]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 6 (local_rank: 6) exitcode : -7 (pid: 5281) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5281 [7]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 7 (local_rank: 7) exitcode : -7 (pid: 5283) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5283

Root Cause (first observed failure): [0]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 0 (local_rank: 0) exitcode : -7 (pid: 5275) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5275

root@09677202c889:/workspace#

deepakn94 commented 1 year ago

Can you try running this (one CPU process per GPU, instead of a single CPU process for all 8 GPUs on the node)?

mpirun -np 8 ./all_reduce_perf_mpi -b 8 -e 128M -f 2 -g 1
deepakn94 commented 1 year ago

Can you also add –shm-size=1g –ulimit memlock=-1 to your docker run command?

awsankur commented 1 year ago

Running the container like below solves the issue. Training works! docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it -v /home/ubuntu/data:/data megatron-training:latest /bin/bash

awsankur commented 1 year ago

Thank you for your help

deepakn94 commented 1 year ago

Great to hear! Going to close this.

ZhangEnmao commented 9 months ago

Hi, I found the same output information with me. Do you know the possible reason ? image

ZhangEnmao commented 9 months ago

Hi, I found the same output information with me. Do you know the possible reason ? image

I think it may cause some errors.