Closed awsankur closed 1 year ago
Are you seeing these errors within a particular docker container?
Yes . Getting these errors in the container nvcr.io/nvidia/pytorch:23.05-py3
Can you check if you get the same error with nvcr.io/nvidia/pytorch:23.04-py3
?
Just tried it. I get the exact same error.
Interesting, this works for us locally.
I think this is related to your NCCL setup. Are you able to run nccl_tests
in the same setup? Or something simple that uses torch.distributed
: https://pytorch.org/tutorials/intermediate/dist_tuto.html#setup.
I am able to run NCCL tests on my node. Here is the result I get:
root@f218865125ae:/opt/nccl-tests/build# ./all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 66 on f218865125ae device 0 [0x10] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 66 on f218865125ae device 1 [0x10] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 66 on f218865125ae device 2 [0x20] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 66 on f218865125ae device 3 [0x20] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 66 on f218865125ae device 4 [0x90] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 66 on f218865125ae device 5 [0x90] NVIDIA A100-SXM4-80GB
# Rank 6 Group 0 Pid 66 on f218865125ae device 6 [0xa0] NVIDIA A100-SXM4-80GB
# Rank 7 Group 0 Pid 66 on f218865125ae device 7 [0xa0] NVIDIA A100-SXM4-80GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 75.64 0.00 0.00 0 74.20 0.00 0.00 0
16 4 float sum -1 73.47 0.00 0.00 0 73.32 0.00 0.00 0
32 8 float sum -1 73.25 0.00 0.00 0 74.40 0.00 0.00 0
64 16 float sum -1 74.39 0.00 0.00 0 73.35 0.00 0.00 0
128 32 float sum -1 73.84 0.00 0.00 0 73.68 0.00 0.00 0
256 64 float sum -1 74.51 0.00 0.01 0 74.40 0.00 0.01 0
512 128 float sum -1 73.45 0.01 0.01 0 73.21 0.01 0.01 0
1024 256 float sum -1 75.96 0.01 0.02 0 76.13 0.01 0.02 0
2048 512 float sum -1 86.42 0.02 0.04 0 83.34 0.02 0.04 0
4096 1024 float sum -1 93.53 0.04 0.08 0 91.42 0.04 0.08 0
8192 2048 float sum -1 94.95 0.09 0.15 0 94.23 0.09 0.15 0
16384 4096 float sum -1 97.16 0.17 0.30 0 97.85 0.17 0.29 0
32768 8192 float sum -1 111.5 0.29 0.51 0 111.8 0.29 0.51 0
65536 16384 float sum -1 117.1 0.56 0.98 0 116.0 0.56 0.99 0
131072 32768 float sum -1 124.3 1.05 1.84 0 123.7 1.06 1.85 0
262144 65536 float sum -1 126.0 2.08 3.64 0 127.2 2.06 3.61 0
524288 131072 float sum -1 138.8 3.78 6.61 0 132.1 3.97 6.95 0
1048576 262144 float sum -1 141.3 7.42 12.99 0 143.7 7.30 12.77 0
2097152 524288 float sum -1 152.4 13.76 24.09 0 152.6 13.74 24.05 0
4194304 1048576 float sum -1 169.6 24.73 43.27 0 167.8 25.00 43.75 0
8388608 2097152 float sum -1 190.1 44.13 77.23 0 192.5 43.59 76.28 0
16777216 4194304 float sum -1 221.4 75.79 132.63 0 217.2 77.23 135.15 0
33554432 8388608 float sum -1 356.8 94.05 164.59 0 355.7 94.33 165.08 0
67108864 16777216 float sum -1 574.8 116.76 204.32 0 574.5 116.80 204.41 0
134217728 33554432 float sum -1 1158.2 115.89 202.81 0 1154.2 116.29 203.51 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 35.1127
#
I am also able to train another another model with DDP using torchrun without any issues. The issue arises only when running MegatronLM code. Since it works for you locally, how can I help you debug this?
Can you run your Megatron command with NCCL_DEBUG=INFO
and send the logfile here?
Here it is:
root@09677202c889:/workspace# torchrun --standalone --nnodes=1 --nproc_per_node=8 /workspace/Megatron-LM/pretrain_gpt.py --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 8 --global-batch-size 64 --lr 0.00015 --train-iters 500000 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --lr-warmup-fraction .01 --clip-grad 1.0 --fp16 --data-path /data/gpt2/my-gpt2_text_document --vocab-file /data/gpt2/gpt2-vocab.json --merge-file /data/gpt2/gpt2-merges.txt --split 949,50,1 --log-interval 1 --save-interval 10000 --eval-interval 1000 --eval-iters 40 master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages using world size: 8, data-parallel-size: 8, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 add_bias_linear ................................. True add_position_embedding .......................... True adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_layernorm_1p .............................. False apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False async_tensor_model_parallel_allreduce ........... True attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False barrier_with_L1_time ............................ True bert_binary_head ................................ True bert_embedder_type .............................. megatron bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None check_for_nan_in_loss_and_grad .................. True classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 data_cache_path ................................. None data_parallel_random_init ....................... False data_parallel_size .............................. 8 data_path ....................................... ['/data/gpt2/my-gpt2_text_document'] data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single decoder_num_layers .............................. None decoder_seq_length .............................. None dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 distribute_saved_activations .................... False distributed_backend ............................. nccl distributed_timeout_minutes ..................... 10 embedding_path .................................. None embedding_weights_in_fp32 ....................... False empty_unused_memory_level ....................... 0 encoder_num_layers .............................. 24 encoder_seq_length .............................. 1024 end_weight_decay ................................ 0.01 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 40 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_on_missing_checkpoint ...................... False exit_signal_handler ............................. False ffn_hidden_size ................................. 4096 finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8 ............................................. None fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 64 gradient_accumulation_fusion .................... True group_query_attention ........................... False head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.1 hidden_size ..................................... 1024 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 iter_per_epoch .................................. 1250 kv_channels ..................................... 64 lazy_mpu_init ................................... None load ............................................ None local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 1 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.00015 lr_decay_iters .................................. 320000 lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_warmup_fraction .............................. 0.01 lr_warmup_init .................................. 0.0 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... True max_position_embeddings ......................... 1024 max_tokens_to_oom ............................... 12000 merge_file ...................................... /data/gpt2/gpt2-merges.txt micro_batch_size ................................ 8 min_loss_scale .................................. 1.0 min_lr .......................................... 1e-05 mmap_warmup ..................................... False no_load_optim ................................... None no_load_rng ..................................... None no_persist_layer_norm ........................... False no_save_optim ................................... None no_save_rng ..................................... None norm_epsilon .................................... 1e-05 normalization ................................... LayerNorm num_attention_heads ............................. 16 num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... None num_layers ...................................... 24 num_layers_per_virtual_pipeline_stage ........... None num_query_groups ................................ 1 num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam output_bert_embeddings .......................... False overlap_grad_reduce ............................. False overlap_p2p_comm ................................ False override_opt_param_scheduler .................... False params_dtype .................................... torch.float16 patch_dim ....................................... 16 perform_initialization .......................... True pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... learned_absolute profile ......................................... False profile_ranks ................................... [0] profile_step_end ................................ 12 profile_step_start .............................. 10 query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ None reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 retro_add_retriever ............................. False retro_cyclic_train_iters ........................ None retro_encoder_attention_dropout ................. 0.1 retro_encoder_hidden_dropout .................... 0.1 retro_encoder_layers ............................ 2 retro_num_neighbors ............................. 2 retro_num_retrieved_chunks ...................... 2 retro_return_doc_ids ............................ False retro_workdir ................................... None rotary_percent .................................. 1.0 rotary_seq_len_interpolation_factor ............. None sample_rate ..................................... 1.0 save ............................................ None save_interval ................................... 10000 scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 1024 sequence_parallel ............................... False sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 skip_train ...................................... False split ........................................... 949,50,1 squared_relu .................................... False standalone_embedding_stage ...................... False start_weight_decay .............................. 0.01 swiglu .......................................... False swin_backbone_type .............................. tiny tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None timing_log_level ................................ 0 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_model ................................. None tokenizer_type .................................. GPT2BPETokenizer train_data_path ................................. None train_iters ..................................... 500000 train_samples ................................... None transformer_impl ................................ local transformer_pipeline_model_parallel_size ........ 1 untie_embeddings_and_output_weights ............. False use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_cpu_initialization .......................... None use_distributed_optimizer ....................... False use_flash_attn .................................. False use_one_sent_docs ............................... False use_ring_exchange_p2p ........................... False use_rotary_position_embeddings .................. False valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... /data/gpt2/gpt2-vocab.json vocab_size ...................................... None weight_decay .................................... 0.01 weight_decay_incr_style ......................... constant world_size ...................................... 8 -------------------- end of arguments --------------------- setting number of micro-batches to constant 1
building GPT2BPETokenizer tokenizer ... padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) initializing torch distributed ... initialized tensor model parallel with size 1 initialized pipeline model parallel with size 1 setting random seeds to 1234 ... compiling dataset index builder ... make: Entering directory '/workspace/Megatron-LM/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/workspace/Megatron-LM/megatron/data'
done with dataset index builder. Compilation time: 0.075 seconds compiling and loading fused kernels ... 09677202c889:5779:5779 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5779:5779 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5779:5779 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5779:5779 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5779:5779 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5779:5779 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.17.1+cuda12.1 09677202c889:5780:5780 [1] NCCL INFO cudaDriverVersion 12020 09677202c889:5780:5780 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5787:5787 [7] NCCL INFO cudaDriverVersion 12020 09677202c889:5787:5787 [7] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5783:5783 [4] NCCL INFO cudaDriverVersion 12020 09677202c889:5783:5783 [4] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5785:5785 [6] NCCL INFO cudaDriverVersion 12020 09677202c889:5782:5782 [3] NCCL INFO cudaDriverVersion 12020 09677202c889:5785:5785 [6] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5782:5782 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5781:5781 [2] NCCL INFO cudaDriverVersion 12020 09677202c889:5784:5784 [5] NCCL INFO cudaDriverVersion 12020 09677202c889:5781:5781 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5784:5784 [5] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> 09677202c889:5783:5783 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5783:5783 [4] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5783:5783 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5783:5783 [4] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5785:5785 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5785:5785 [6] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5782:5782 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5785:5785 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5785:5785 [6] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5782:5782 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5782:5782 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5782:5782 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5781:5781 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5781:5781 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5781:5781 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5781:5781 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5787:5787 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5787:5787 [7] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5787:5787 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5787:5787 [7] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5780:5780 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5780:5780 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5780:5780 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5780:5780 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5784:5784 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. 09677202c889:5784:5784 [5] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) 09677202c889:5784:5784 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. 09677202c889:5784:5784 [5] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) 09677202c889:5779:6160 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5779:6160 [0] NCCL INFO P2P plugin IBext 09677202c889:5779:6160 [0] NCCL INFO NET/IB : No device found. 09677202c889:5779:6160 [0] NCCL INFO NET/IB : No device found. 09677202c889:5779:6160 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5779:6160 [0] NCCL INFO Using network Socket 09677202c889:5783:6165 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5783:6165 [4] NCCL INFO P2P plugin IBext 09677202c889:5783:6165 [4] NCCL INFO NET/IB : No device found. 09677202c889:5783:6165 [4] NCCL INFO NET/IB : No device found. 09677202c889:5783:6165 [4] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5783:6165 [4] NCCL INFO Using network Socket 09677202c889:5781:6168 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5781:6168 [2] NCCL INFO P2P plugin IBext 09677202c889:5781:6168 [2] NCCL INFO NET/IB : No device found. 09677202c889:5781:6168 [2] NCCL INFO NET/IB : No device found. 09677202c889:5781:6168 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5781:6168 [2] NCCL INFO Using network Socket 09677202c889:5787:6171 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5787:6171 [7] NCCL INFO P2P plugin IBext 09677202c889:5787:6171 [7] NCCL INFO NET/IB : No device found. 09677202c889:5787:6171 [7] NCCL INFO NET/IB : No device found. 09677202c889:5787:6171 [7] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5787:6171 [7] NCCL INFO Using network Socket 09677202c889:5782:6167 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5782:6167 [3] NCCL INFO P2P plugin IBext 09677202c889:5782:6167 [3] NCCL INFO NET/IB : No device found. 09677202c889:5782:6167 [3] NCCL INFO NET/IB : No device found. 09677202c889:5782:6167 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5782:6167 [3] NCCL INFO Using network Socket 09677202c889:5780:6172 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5780:6172 [1] NCCL INFO P2P plugin IBext 09677202c889:5780:6172 [1] NCCL INFO NET/IB : No device found. 09677202c889:5780:6172 [1] NCCL INFO NET/IB : No device found. 09677202c889:5780:6172 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5780:6172 [1] NCCL INFO Using network Socket 09677202c889:5785:6166 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5785:6166 [6] NCCL INFO P2P plugin IBext 09677202c889:5785:6166 [6] NCCL INFO NET/IB : No device found. 09677202c889:5785:6166 [6] NCCL INFO NET/IB : No device found. 09677202c889:5785:6166 [6] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5785:6166 [6] NCCL INFO Using network Socket 09677202c889:5784:6174 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 09677202c889:5784:6174 [5] NCCL INFO P2P plugin IBext 09677202c889:5784:6174 [5] NCCL INFO NET/IB : No device found. 09677202c889:5784:6174 [5] NCCL INFO NET/IB : No device found. 09677202c889:5784:6174 [5] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> 09677202c889:5784:6174 [5] NCCL INFO Using network Socket 09677202c889:5787:6171 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000 09677202c889:5784:6174 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000 09677202c889:5783:6165 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000 09677202c889:5782:6167 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff 09677202c889:5780:6172 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff 09677202c889:5781:6168 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff 09677202c889:5779:6160 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff 09677202c889:5785:6166 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000 09677202c889:5779:6160 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 09677202c889:5787:6171 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 09677202c889:5785:6166 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 09677202c889:5779:6160 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 09677202c889:5784:6174 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 09677202c889:5787:6171 [7] NCCL INFO P2P Chunksize set to 524288 09677202c889:5785:6166 [6] NCCL INFO P2P Chunksize set to 524288 09677202c889:5779:6160 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 09677202c889:5780:6172 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 09677202c889:5783:6165 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 09677202c889:5781:6168 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 09677202c889:5784:6174 [5] NCCL INFO P2P Chunksize set to 524288 09677202c889:5779:6160 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 09677202c889:5780:6172 [1] NCCL INFO P2P Chunksize set to 524288 09677202c889:5782:6167 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 09677202c889:5783:6165 [4] NCCL INFO P2P Chunksize set to 524288 09677202c889:5781:6168 [2] NCCL INFO P2P Chunksize set to 524288 09677202c889:5779:6160 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 09677202c889:5782:6167 [3] NCCL INFO P2P Chunksize set to 524288 09677202c889:5779:6160 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 09677202c889:5779:6160 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 09677202c889:5779:6160 [0] NCCL INFO P2P Chunksize set to 524288 09677202c889:6257:6638 [6] NCCL INFO Channel 00/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 00/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 00/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 00/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 00/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 00/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 00/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 00/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 01/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 01/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 01/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 01/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 01/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 01/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 01/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 01/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 02/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 02/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 02/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 02/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 02/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 02/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 02/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 02/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 03/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 03/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 03/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 03/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 03/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 03/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 03/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 03/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 04/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 04/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 04/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 04/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 04/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 04/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 04/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 04/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 05/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 05/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 05/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 05/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 05/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 05/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 05/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 06/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 06/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 06/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 06/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 06/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 05/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 06/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 07/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 06/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 07/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 07/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 07/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 07/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 06/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 07/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 08/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 08/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 08/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 07/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 08/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 07/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 08/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 08/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 09/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 09/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 09/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 09/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 09/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 08/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 09/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 08/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 10/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 10/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 10/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 10/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 10/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 09/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 09/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 10/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 11/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 11/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 11/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 11/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 11/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 10/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 11/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 10/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 12/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 12/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 12/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 12/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 12/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 11/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 12/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 11/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 13/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 13/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 13/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 13/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 13/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 13/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 12/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 12/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 14/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 14/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 14/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 14/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 14/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 14/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 13/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 13/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 15/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 15/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 15/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 15/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 15/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 15/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 14/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 14/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 16/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 16/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 16/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 16/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 16/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 16/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 15/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 15/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 17/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 17/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 17/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 17/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 17/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 17/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 16/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 16/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 18/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 18/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 18/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 18/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 18/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 18/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 17/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 17/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 19/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 19/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 19/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 19/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 19/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 19/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 18/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 18/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 20/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 20/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 20/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 20/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 20/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 20/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 19/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 19/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 21/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 21/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 21/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 21/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 21/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 21/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 20/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 22/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 22/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 22/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 20/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 22/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 22/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 22/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 23/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 21/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 23/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 23/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 21/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 23/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 23/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 23/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Channel 22/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 22/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Connected all rings 09677202c889:6254:6636 [3] NCCL INFO Connected all rings 09677202c889:6251:6632 [0] NCCL INFO Channel 23/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 23/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6251:6632 [0] NCCL INFO Connected all rings 09677202c889:6256:6637 [5] NCCL INFO Connected all rings 09677202c889:6259:6640 [7] NCCL INFO Connected all rings 09677202c889:6259:6640 [7] NCCL INFO Channel 00/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Connected all rings 09677202c889:6255:6646 [4] NCCL INFO Connected all rings 09677202c889:6257:6638 [6] NCCL INFO Connected all rings 09677202c889:6259:6640 [7] NCCL INFO Channel 01/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 02/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 03/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 04/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 05/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 06/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 07/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 08/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 09/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 10/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 11/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 12/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 13/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 14/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 15/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 16/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 17/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 18/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 19/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 00/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 00/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 20/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 01/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 01/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 21/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 02/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 02/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 22/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 00/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 00/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 03/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 03/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6259:6640 [7] NCCL INFO Channel 23/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 00/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 01/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 01/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 04/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 04/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 00/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 02/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 01/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 02/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 05/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 05/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 03/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 01/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 03/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 02/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 06/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 06/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 04/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 02/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 04/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 03/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 07/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 07/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 03/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 05/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 05/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 08/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 08/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 04/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 06/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 04/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 06/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 09/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 09/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 05/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 07/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 05/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 07/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 10/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 10/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 06/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 06/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 08/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 08/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 11/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 11/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 07/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 09/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 07/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 09/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 12/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 12/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 08/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 10/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 08/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 10/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 13/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 13/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 09/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 11/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 09/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 11/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 14/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 14/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 10/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 12/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 10/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 12/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 15/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 15/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 11/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 13/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 11/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 13/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 16/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 16/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 12/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 12/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 14/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 14/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 17/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 17/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 13/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 13/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 15/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 15/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 18/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 18/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 14/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 14/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 16/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 16/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 19/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 19/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 17/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 15/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 15/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 17/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 20/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 20/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 18/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 16/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 18/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 16/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 21/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 21/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 19/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 17/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 19/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 17/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 22/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 22/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 20/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 18/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 20/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 18/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6254:6636 [3] NCCL INFO Channel 23/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:6253:6644 [2] NCCL INFO Channel 23/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 21/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 19/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 21/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 19/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 20/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 22/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 22/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 20/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 21/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6252:6642 [1] NCCL INFO Channel 23/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:6255:6646 [4] NCCL INFO Channel 23/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 21/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:6256:6637 [5] NCCL INFO Channel 22/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:6257:6638 [6] NCCL INFO Channel 22/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5275:5656 [0] NCCL INFO Connected all trees 09677202c889:5275:5656 [0] NCCL INFO NVLS multicast support is not available on dev 0 09677202c889:5275:5656 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5276:5662 [1] NCCL INFO Connected all trees 09677202c889:5276:5662 [1] NCCL INFO NVLS multicast support is not available on dev 1 09677202c889:5276:5662 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5278:5664 [3] NCCL INFO Connected all trees 09677202c889:5278:5664 [3] NCCL INFO NVLS multicast support is not available on dev 3 09677202c889:5278:5664 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5276:5662 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5278:5664 [3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5275:5656 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5277:5663 [2] NCCL INFO Connected all trees 09677202c889:5277:5663 [2] NCCL INFO NVLS multicast support is not available on dev 2 09677202c889:5277:5663 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5277:5663 [2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5281:5669 [6] NCCL INFO Channel 23/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5665 [5] NCCL INFO Channel 23/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5283:5670 [7] NCCL INFO Connected all trees 09677202c889:5283:5670 [7] NCCL INFO NVLS multicast support is not available on dev 7 09677202c889:5283:5670 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5283:5670 [7] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5280:5665 [5] NCCL INFO Connected all trees 09677202c889:5280:5665 [5] NCCL INFO NVLS multicast support is not available on dev 5 09677202c889:5280:5665 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5280:5665 [5] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5281:5669 [6] NCCL INFO Connected all trees 09677202c889:5281:5669 [6] NCCL INFO NVLS multicast support is not available on dev 6 09677202c889:5281:5669 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5281:5669 [6] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5279:5666 [4] NCCL INFO Connected all trees 09677202c889:5279:5666 [4] NCCL INFO NVLS multicast support is not available on dev 4 09677202c889:5279:5666 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5279:5666 [4] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 09677202c889:5281:5669 [6] NCCL INFO comm 0x83a93c0 rank 6 nranks 8 cudaDev 6 busId a01c0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5283:5670 [7] NCCL INFO comm 0x7b3cf40 rank 7 nranks 8 cudaDev 7 busId a01d0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5277:5663 [2] NCCL INFO comm 0x8a2f6e0 rank 2 nranks 8 cudaDev 2 busId 201c0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5278:5664 [3] NCCL INFO comm 0x9104200 rank 3 nranks 8 cudaDev 3 busId 201d0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5276:5662 [1] NCCL INFO comm 0x8a3b120 rank 1 nranks 8 cudaDev 1 busId 101d0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5280:5665 [5] NCCL INFO comm 0x853d0a0 rank 5 nranks 8 cudaDev 5 busId 901d0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5279:5666 [4] NCCL INFO comm 0x928b5a0 rank 4 nranks 8 cudaDev 4 busId 901c0 commId 0x15f48e7ced62c065 - Init COMPLETE 09677202c889:5275:5656 [0] NCCL INFO comm 0x8dee0a0 rank 0 nranks 8 cudaDev 0 busId 101c0 commId 0x15f48e7ced62c065 - Init COMPLETE done with compiling and loading fused kernels. Compilation time: 6.855 seconds time to initialize megatron (seconds): 10.024 [after megatron is initialized] datetime: 2023-09-24 17:31:28 building GPT model ... number of parameters on (tensor, pipeline) model parallel rank (0, 0): 354871296 buckets for gradient all-reduce: params for bucket 1 module.language_model.encoder.layers.22.self_attention.query_key_value.bias module.language_model.encoder.layers.17.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.13.post_attention_norm.bias module.language_model.encoder.layers.9.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.5.self_attention.dense.bias module.language_model.encoder.layers.0.input_norm.weight module.language_model.encoder.layers.1.self_attention.query_key_value.weight module.language_model.encoder.layers.23.input_norm.weight module.language_model.encoder.layers.18.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.14.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.10.post_attention_norm.weight module.language_model.encoder.layers.6.self_attention.dense.weight module.language_model.encoder.layers.2.self_attention.query_key_value.weight module.language_model.encoder.layers.0.post_attention_norm.weight module.language_model.encoder.final_norm.bias module.language_model.encoder.layers.20.self_attention.query_key_value.bias module.language_model.encoder.layers.15.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.11.post_attention_norm.bias module.language_model.encoder.layers.7.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.3.self_attention.dense.bias module.language_model.encoder.layers.0.self_attention.dense.bias module.language_model.encoder.layers.21.self_attention.query_key_value.bias module.language_model.encoder.layers.16.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.12.post_attention_norm.bias module.language_model.encoder.layers.8.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.4.self_attention.dense.bias module.language_model.encoder.layers.0.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.22.input_norm.weight module.language_model.encoder.layers.17.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.13.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.9.post_attention_norm.weight module.language_model.encoder.layers.5.self_attention.dense.weight module.language_model.encoder.layers.1.self_attention.dense.bias module.language_model.encoder.layers.23.input_norm.bias module.language_model.encoder.layers.19.self_attention.query_key_value.bias module.language_model.encoder.layers.14.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.10.post_attention_norm.bias module.language_model.encoder.layers.6.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.2.self_attention.dense.bias module.language_model.encoder.layers.0.self_attention.query_key_value.weight module.language_model.encoder.layers.20.input_norm.weight module.language_model.encoder.layers.15.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.11.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.7.post_attention_norm.weight module.language_model.encoder.layers.3.self_attention.dense.weight module.language_model.encoder.layers.21.input_norm.weight module.language_model.encoder.layers.16.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.12.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.8.post_attention_norm.weight module.language_model.encoder.layers.4.self_attention.dense.weight module.language_model.encoder.layers.22.input_norm.bias module.language_model.encoder.layers.18.self_attention.query_key_value.bias module.language_model.encoder.layers.13.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.9.post_attention_norm.bias module.language_model.encoder.layers.5.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.0.mlp.dense_h_to_4h.weight module.language_model.embedding.word_embeddings.weight module.language_model.encoder.layers.23.self_attention.query_key_value.weight module.language_model.encoder.layers.19.input_norm.weight module.language_model.encoder.layers.14.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.10.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.6.post_attention_norm.weight module.language_model.encoder.layers.2.self_attention.dense.weight module.language_model.encoder.layers.0.input_norm.bias module.language_model.encoder.layers.20.input_norm.bias module.language_model.encoder.layers.16.self_attention.query_key_value.bias module.language_model.encoder.layers.11.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.7.post_attention_norm.bias module.language_model.encoder.layers.3.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.21.input_norm.bias module.language_model.encoder.layers.17.self_attention.query_key_value.bias module.language_model.encoder.layers.12.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.8.post_attention_norm.bias module.language_model.encoder.layers.4.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.1.self_attention.dense.weight module.language_model.encoder.layers.22.self_attention.query_key_value.weight module.language_model.encoder.layers.18.input_norm.weight module.language_model.encoder.layers.13.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.9.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.5.post_attention_norm.weight module.language_model.encoder.layers.0.self_attention.query_key_value.bias module.language_model.encoder.layers.23.self_attention.dense.bias module.language_model.encoder.layers.19.input_norm.bias module.language_model.encoder.layers.15.self_attention.query_key_value.bias module.language_model.encoder.layers.10.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.6.post_attention_norm.bias module.language_model.encoder.layers.2.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.1.input_norm.weight module.language_model.encoder.layers.20.self_attention.query_key_value.weight module.language_model.encoder.layers.16.input_norm.weight module.language_model.encoder.layers.11.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.7.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.3.post_attention_norm.weight module.language_model.encoder.layers.21.self_attention.query_key_value.weight module.language_model.encoder.layers.17.input_norm.weight module.language_model.encoder.layers.12.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.8.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.4.post_attention_norm.weight module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.22.self_attention.dense.bias module.language_model.encoder.layers.18.input_norm.bias module.language_model.encoder.layers.14.self_attention.query_key_value.bias module.language_model.encoder.layers.9.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.5.post_attention_norm.bias module.language_model.encoder.layers.1.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.23.self_attention.dense.weight module.language_model.encoder.layers.19.self_attention.query_key_value.weight module.language_model.encoder.layers.15.input_norm.weight module.language_model.encoder.layers.10.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.6.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.2.post_attention_norm.weight module.language_model.encoder.layers.20.self_attention.dense.bias module.language_model.encoder.layers.16.input_norm.bias module.language_model.encoder.layers.12.self_attention.query_key_value.bias module.language_model.encoder.layers.7.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.3.post_attention_norm.bias module.language_model.encoder.layers.0.post_attention_norm.bias module.language_model.encoder.layers.21.self_attention.dense.bias module.language_model.encoder.layers.17.input_norm.bias module.language_model.encoder.layers.13.self_attention.query_key_value.bias module.language_model.encoder.layers.8.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.4.post_attention_norm.bias module.language_model.encoder.layers.1.post_attention_norm.weight module.language_model.encoder.layers.22.self_attention.dense.weight module.language_model.encoder.layers.18.self_attention.query_key_value.weight module.language_model.encoder.layers.14.input_norm.weight module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.5.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.23.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.19.self_attention.dense.bias module.language_model.encoder.layers.15.input_norm.bias module.language_model.encoder.layers.11.self_attention.query_key_value.bias module.language_model.encoder.layers.6.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.2.post_attention_norm.bias module.language_model.encoder.layers.0.self_attention.dense.weight module.language_model.encoder.layers.20.self_attention.dense.weight module.language_model.encoder.layers.16.self_attention.query_key_value.weight module.language_model.encoder.layers.12.input_norm.weight module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.3.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.21.self_attention.dense.weight module.language_model.encoder.layers.17.self_attention.query_key_value.weight module.language_model.encoder.layers.13.input_norm.weight module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.4.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.22.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.18.self_attention.dense.bias module.language_model.encoder.layers.14.input_norm.bias module.language_model.encoder.layers.10.self_attention.query_key_value.bias module.language_model.encoder.layers.5.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.1.post_attention_norm.bias module.language_model.encoder.layers.23.post_attention_norm.weight module.language_model.encoder.layers.19.self_attention.dense.weight module.language_model.encoder.layers.15.self_attention.query_key_value.weight module.language_model.encoder.layers.11.input_norm.weight module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.2.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.20.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.16.self_attention.dense.bias module.language_model.encoder.layers.12.input_norm.bias module.language_model.encoder.layers.8.self_attention.query_key_value.bias module.language_model.encoder.layers.3.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.21.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.17.self_attention.dense.bias module.language_model.encoder.layers.13.input_norm.bias module.language_model.encoder.layers.9.self_attention.query_key_value.bias module.language_model.encoder.layers.4.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.22.post_attention_norm.weight module.language_model.encoder.layers.18.self_attention.dense.weight module.language_model.encoder.layers.14.self_attention.query_key_value.weight module.language_model.encoder.layers.10.input_norm.weight module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.1.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.23.post_attention_norm.bias module.language_model.encoder.layers.19.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.15.self_attention.dense.bias module.language_model.encoder.layers.11.input_norm.bias module.language_model.encoder.layers.7.self_attention.query_key_value.bias module.language_model.encoder.layers.2.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.20.post_attention_norm.weight module.language_model.encoder.layers.16.self_attention.dense.weight module.language_model.encoder.layers.12.self_attention.query_key_value.weight module.language_model.encoder.layers.8.input_norm.weight module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.0.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.21.post_attention_norm.weight module.language_model.encoder.layers.17.self_attention.dense.weight module.language_model.encoder.layers.13.self_attention.query_key_value.weight module.language_model.encoder.layers.9.input_norm.weight module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.1.input_norm.bias module.language_model.encoder.layers.22.post_attention_norm.bias module.language_model.encoder.layers.18.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.14.self_attention.dense.bias module.language_model.encoder.layers.10.input_norm.bias module.language_model.encoder.layers.6.self_attention.query_key_value.bias module.language_model.encoder.layers.1.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.23.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.19.post_attention_norm.weight module.language_model.encoder.layers.15.self_attention.dense.weight module.language_model.encoder.layers.11.self_attention.query_key_value.weight module.language_model.encoder.layers.7.input_norm.weight module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.20.post_attention_norm.bias module.language_model.encoder.layers.16.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.12.self_attention.dense.bias module.language_model.encoder.layers.8.input_norm.bias module.language_model.encoder.layers.4.self_attention.query_key_value.bias module.language_model.encoder.layers.21.post_attention_norm.bias module.language_model.encoder.layers.17.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.13.self_attention.dense.bias module.language_model.encoder.layers.9.input_norm.bias module.language_model.encoder.layers.5.self_attention.query_key_value.bias module.language_model.encoder.layers.22.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.18.post_attention_norm.weight module.language_model.encoder.layers.14.self_attention.dense.weight module.language_model.encoder.layers.10.self_attention.query_key_value.weight module.language_model.encoder.layers.6.input_norm.weight module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.3.self_attention.query_key_value.bias module.language_model.encoder.layers.23.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.19.post_attention_norm.bias module.language_model.encoder.layers.15.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.11.self_attention.dense.bias module.language_model.encoder.layers.7.input_norm.bias module.language_model.encoder.layers.20.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.16.post_attention_norm.weight module.language_model.encoder.layers.12.self_attention.dense.weight module.language_model.encoder.layers.8.self_attention.query_key_value.weight module.language_model.encoder.layers.4.input_norm.weight module.language_model.encoder.layers.21.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.17.post_attention_norm.weight module.language_model.encoder.layers.13.self_attention.dense.weight module.language_model.encoder.layers.9.self_attention.query_key_value.weight module.language_model.encoder.layers.5.input_norm.weight module.language_model.encoder.layers.22.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.18.post_attention_norm.bias module.language_model.encoder.layers.14.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.10.self_attention.dense.bias module.language_model.encoder.layers.6.input_norm.bias module.language_model.encoder.layers.2.self_attention.query_key_value.bias module.language_model.encoder.layers.23.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.19.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.15.post_attention_norm.weight module.language_model.encoder.layers.11.self_attention.dense.weight module.language_model.encoder.layers.7.self_attention.query_key_value.weight module.language_model.encoder.layers.3.input_norm.weight module.language_model.encoder.layers.20.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.16.post_attention_norm.bias module.language_model.encoder.layers.12.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.8.self_attention.dense.bias module.language_model.encoder.layers.4.input_norm.bias module.language_model.encoder.layers.1.self_attention.query_key_value.bias module.language_model.encoder.layers.21.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.17.post_attention_norm.bias module.language_model.encoder.layers.13.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.9.self_attention.dense.bias module.language_model.encoder.layers.5.input_norm.bias module.language_model.encoder.layers.22.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.18.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.14.post_attention_norm.weight module.language_model.encoder.layers.10.self_attention.dense.weight module.language_model.encoder.layers.6.self_attention.query_key_value.weight module.language_model.encoder.layers.2.input_norm.weight module.language_model.encoder.layers.19.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.15.post_attention_norm.bias module.language_model.encoder.layers.11.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.7.self_attention.dense.bias module.language_model.encoder.layers.3.input_norm.bias module.language_model.encoder.layers.20.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.16.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.12.post_attention_norm.weight module.language_model.encoder.layers.8.self_attention.dense.weight module.language_model.encoder.layers.4.self_attention.query_key_value.weight module.language_model.embedding.position_embeddings.weight module.language_model.encoder.layers.21.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.17.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.13.post_attention_norm.weight module.language_model.encoder.layers.9.self_attention.dense.weight module.language_model.encoder.layers.5.self_attention.query_key_value.weight module.language_model.encoder.layers.23.self_attention.query_key_value.bias module.language_model.encoder.layers.18.mlp.dense_4h_to_h.bias module.language_model.encoder.layers.14.post_attention_norm.bias module.language_model.encoder.layers.10.mlp.dense_h_to_4h.bias module.language_model.encoder.layers.6.self_attention.dense.bias module.language_model.encoder.layers.2.input_norm.bias module.language_model.encoder.layers.3.self_attention.query_key_value.weight module.language_model.encoder.final_norm.weight module.language_model.encoder.layers.19.mlp.dense_4h_to_h.weight module.language_model.encoder.layers.15.mlp.dense_h_to_4h.weight module.language_model.encoder.layers.11.post_attention_norm.weight module.language_model.encoder.layers.7.self_attention.dense.weight total number of elements: 354871296 learning rate decay style: cosine [after model, optimizer, and learning rate scheduler are built] datetime: 2023-09-24 17:31:28 building train, validation, and test datasets ... datasets target sizes (minimum size): train: 32000000 validation: 1282560 test: 2560 building train, validation, and test datasets for GPT ... Single data path provided for train, valid & test building dataset index ... reading sequence lengths... reading sequence pointers... reading document indices... creating np buffer of mmap... creating memory view of np buffer... finished creating indexed dataset in 0.002253 seconds number of documents: 79000 dataset split: train: document indices in [0, 74971) total of 74971 documents validation: document indices in [74971, 78921) total of 3950 documents test: document indices in [78921, 79000) total of 79 documents 09677202c889:5275:5695 [0] NCCL INFO Using network Socket 09677202c889:5276:5696 [1] NCCL INFO Using network Socket 09677202c889:5277:5697 [2] NCCL INFO Using network Socket 09677202c889:5281:5698 [6] NCCL INFO Using network Socket 09677202c889:5283:5699 [7] NCCL INFO Using network Socket 09677202c889:5280:5700 [5] NCCL INFO Using network Socket 09677202c889:5279:5701 [4] NCCL INFO Using network Socket 09677202c889:5278:5702 [3] NCCL INFO Using network Socket 09677202c889:5280:5700 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000 09677202c889:5283:5699 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000 09677202c889:5278:5702 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff 09677202c889:5281:5698 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000 09677202c889:5277:5697 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff 09677202c889:5275:5695 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff 09677202c889:5276:5696 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff 09677202c889:5279:5701 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000 09677202c889:5275:5695 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 09677202c889:5277:5697 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 09677202c889:5275:5695 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 09677202c889:5276:5696 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 09677202c889:5283:5699 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 09677202c889:5277:5697 [2] NCCL INFO P2P Chunksize set to 524288 09677202c889:5275:5695 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 09677202c889:5279:5701 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 09677202c889:5276:5696 [1] NCCL INFO P2P Chunksize set to 524288 09677202c889:5283:5699 [7] NCCL INFO P2P Chunksize set to 524288 09677202c889:5281:5698 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 09677202c889:5275:5695 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 09677202c889:5280:5700 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 09677202c889:5278:5702 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 09677202c889:5279:5701 [4] NCCL INFO P2P Chunksize set to 524288 09677202c889:5281:5698 [6] NCCL INFO P2P Chunksize set to 524288 09677202c889:5275:5695 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 09677202c889:5280:5700 [5] NCCL INFO P2P Chunksize set to 524288 09677202c889:5278:5702 [3] NCCL INFO P2P Chunksize set to 524288 09677202c889:5275:5695 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 09677202c889:5275:5695 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 09677202c889:5275:5695 [0] NCCL INFO P2P Chunksize set to 524288 09677202c889:5278:5702 [3] NCCL INFO Channel 00/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 00/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 00/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 00/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 00/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 01/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 01/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 00/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 01/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 00/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 00/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 01/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 01/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 02/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 02/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 01/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 01/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 02/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 01/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 02/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 03/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 03/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 02/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 02/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 03/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 03/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 02/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 02/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 03/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 04/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 04/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 03/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 04/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 03/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 04/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 03/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 05/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 05/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 04/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 04/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 05/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 05/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 04/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 04/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 06/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 06/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 05/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 06/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 05/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 05/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 05/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 06/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 07/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 07/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 06/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 07/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 06/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 06/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 06/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 07/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 08/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 08/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 07/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 08/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 07/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 07/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 07/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 08/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 09/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 09/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 08/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 09/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 08/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 08/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 08/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 09/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 10/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 10/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 09/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 10/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 09/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 09/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 09/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 11/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 11/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 10/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 10/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 11/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 10/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 10/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 10/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 12/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 12/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 11/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 11/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 12/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 11/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 11/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 11/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 13/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 13/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 12/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 12/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 13/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 12/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 14/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 14/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 12/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 12/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 13/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 13/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 14/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 15/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 15/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 13/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 13/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 13/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 14/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 15/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 14/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 16/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 16/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 14/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 14/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 14/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 15/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 16/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 15/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 17/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 17/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 15/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 15/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 15/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 16/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 17/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 16/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 18/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 18/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 16/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 16/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 16/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 17/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 18/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 17/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 19/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 19/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 17/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 17/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 17/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 18/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 19/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 18/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 20/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 20/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 18/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 18/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 20/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 18/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 19/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 19/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 21/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 21/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 19/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 21/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 19/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 19/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 20/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 20/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 22/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 22/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 20/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 22/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 20/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 20/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 21/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 21/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 23/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 23/0 : 3[201d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 21/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 23/0 : 7[a01d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 21/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 21/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 22/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 22/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 22/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 22/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 22/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 23/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 23/0 : 5[901d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 23/0 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 23/0 : 2[201c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Channel 23/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Connected all rings 09677202c889:5280:5700 [5] NCCL INFO Connected all rings 09677202c889:5281:5698 [6] NCCL INFO Connected all rings 09677202c889:5277:5697 [2] NCCL INFO Connected all rings 09677202c889:5278:5702 [3] NCCL INFO Connected all rings 09677202c889:5283:5699 [7] NCCL INFO Connected all rings 09677202c889:5283:5699 [7] NCCL INFO Channel 00/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5275:5695 [0] NCCL INFO Connected all rings 09677202c889:5276:5696 [1] NCCL INFO Connected all rings 09677202c889:5283:5699 [7] NCCL INFO Channel 01/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 02/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 03/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 04/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 05/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 06/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 07/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 08/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 09/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 10/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 11/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 12/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 13/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 14/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 15/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 16/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 17/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 18/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 19/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 20/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 21/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 22/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 00/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Channel 23/0 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 00/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 01/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 01/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 00/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 00/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 02/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 00/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 02/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 00/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 01/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 01/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 03/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 01/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 03/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 01/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 02/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 02/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 04/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 04/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 02/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 03/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 02/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 03/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 05/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 05/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 03/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 04/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 03/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 04/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 06/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 06/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 04/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 05/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 04/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 05/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 07/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 07/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 06/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 05/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 05/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 06/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 08/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 08/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 07/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 06/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 07/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 06/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 09/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 09/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 08/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 08/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 07/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 07/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 10/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 10/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 09/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 09/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 08/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 08/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 11/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 11/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 10/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 10/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 09/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 12/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 09/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 12/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 11/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 11/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 10/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 13/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 10/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 13/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 12/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 12/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 14/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 11/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 11/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 14/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 13/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 13/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 15/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 12/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 12/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 15/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 14/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 14/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 16/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 13/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 16/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 13/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 15/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 15/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 17/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 14/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 17/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 14/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 16/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 16/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 18/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 15/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 18/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 15/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 17/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 17/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 19/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 19/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 16/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 16/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 18/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 20/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 18/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 20/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 17/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 17/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 19/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 21/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 19/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 21/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 18/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 18/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 20/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 22/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 20/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 22/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 19/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 19/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 21/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Channel 23/0 : 4[901c0] -> 3[201d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 21/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5281:5698 [6] NCCL INFO Channel 23/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read 09677202c889:5276:5696 [1] NCCL INFO Channel 20/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 20/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 22/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 22/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read 09677202c889:5283:5699 [7] NCCL INFO Connected all trees 09677202c889:5283:5699 [7] NCCL INFO NVLS multicast support is not available on dev 7 09677202c889:5283:5699 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5283:5699 [7] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5283 :0:5704] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5276:5696 [1] NCCL INFO Channel 21/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5278:5702 [3] NCCL INFO Channel 23/0 : 3[201d0] -> 2[201c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 21/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5280:5700 [5] NCCL INFO Channel 23/0 : 5[901d0] -> 4[901c0] via P2P/IPC/read ==== backtrace (tid: 5704) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0
09677202c889:5276:5696 [1] NCCL INFO Channel 22/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 22/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read 09677202c889:5279:5701 [4] NCCL INFO Connected all trees 09677202c889:5279:5701 [4] NCCL INFO NVLS multicast support is not available on dev 4 09677202c889:5279:5701 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5279:5701 [4] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5279 :0:5710] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5280:5700 [5] NCCL INFO Connected all trees 09677202c889:5280:5700 [5] NCCL INFO NVLS multicast support is not available on dev 5 09677202c889:5280:5700 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5280:5700 [5] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5280 :0:5703] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5281:5698 [6] NCCL INFO Connected all trees 09677202c889:5281:5698 [6] NCCL INFO NVLS multicast support is not available on dev 6 09677202c889:5281:5698 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5281:5698 [6] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5281 :0:5706] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5276:5696 [1] NCCL INFO Channel 23/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read 09677202c889:5277:5697 [2] NCCL INFO Channel 23/0 : 2[201c0] -> 1[101d0] via P2P/IPC/read ==== backtrace (tid: 5703) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0
==== backtrace (tid: 5706) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0
==== backtrace (tid: 5710) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0
09677202c889:5275:5695 [0] NCCL INFO Connected all trees 09677202c889:5275:5695 [0] NCCL INFO NVLS multicast support is not available on dev 0 09677202c889:5275:5695 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5275:5695 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5275 :0:5708] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5277:5697 [2] NCCL INFO Connected all trees 09677202c889:5277:5697 [2] NCCL INFO NVLS multicast support is not available on dev 2 09677202c889:5277:5697 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5277:5697 [2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5277 :0:5707] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5276:5696 [1] NCCL INFO Connected all trees 09677202c889:5276:5696 [1] NCCL INFO NVLS multicast support is not available on dev 1 09677202c889:5276:5696 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5276:5696 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5276 :0:5709] Caught signal 7 (Bus error: nonexistent physical address) 09677202c889:5278:5702 [3] NCCL INFO Connected all trees 09677202c889:5278:5702 [3] NCCL INFO NVLS multicast support is not available on dev 3 09677202c889:5278:5702 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 09677202c889:5278:5702 [3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [09677202c889:5278 :0:5705] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 5707) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0
==== backtrace (tid: 5709) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0
==== backtrace (tid: 5705) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0
==== backtrace (tid: 5708) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000008170d ncclGroupEnd() ???:0 2 0x00000000000742f0 ncclGroupEnd() ???:0 3 0x0000000000008609 start_thread() ???:0 4 0x000000000011f133 clone() ???:0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 5275) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: /workspace/Megatron-LM/pretrain_gpt.py FAILED
Failures: [1]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 1 (local_rank: 1) exitcode : -7 (pid: 5276) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5276 [2]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 2 (local_rank: 2) exitcode : -7 (pid: 5277) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5277 [3]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 3 (local_rank: 3) exitcode : -7 (pid: 5278) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5278 [4]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 4 (local_rank: 4) exitcode : -7 (pid: 5279) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5279 [5]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 5 (local_rank: 5) exitcode : -7 (pid: 5280) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5280 [6]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 6 (local_rank: 6) exitcode : -7 (pid: 5281) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5281 [7]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 7 (local_rank: 7) exitcode : -7 (pid: 5283) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5283
Root Cause (first observed failure): [0]: time : 2023-09-24_17:31:35 host : 09677202c889 rank : 0 (local_rank: 0) exitcode : -7 (pid: 5275) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5275
root@09677202c889:/workspace#
Can you try running this (one CPU process per GPU, instead of a single CPU process for all 8 GPUs on the node)?
mpirun -np 8 ./all_reduce_perf_mpi -b 8 -e 128M -f 2 -g 1
Can you also add –shm-size=1g –ulimit memlock=-1
to your docker run
command?
Running the container like below solves the issue. Training works!
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it -v /home/ubuntu/data:/data megatron-training:latest /bin/bash
Thank you for your help
Great to hear! Going to close this.
Hi, I found the same output information with me. Do you know the possible reason ?
Hi, I found the same output information with me. Do you know the possible reason ?
I think it may cause some errors.
Describe the bug Single node (8 A100 GPUs) training with pretrain_gpt.py errors out.
To Reproduce Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
Expected behavior A clear and concise description of what you expected to happen. Something like:
Stack trace/logs If applicable, add the stack trace or logs from the time of the error.
============================================================ root@173632d0f02d:/workspace# export CUDA_DEVICE_MAX_CONNECTIONS=1 root@173632d0f02d:/workspace# torchrun --standalone --nnodes=1 --nproc_per_node=8 /workspace/Megatron-LM/pretrain_gpt.py --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 8 --global-batch-size 64 --lr 0.00015 --train-iters 500000 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --lr-warmup-fraction .01 --clip-grad 1.0 --fp16 --data-path /data/fsx/gpt2/my-gpt2_text_document --vocab-file /data/fsx/gpt2/gpt2-vocab.json --merge-file /data/fsx/gpt2/gpt2-merges.txt --split 949,50,1 --log-interval 1 --save-interval 10000 --eval-interval 1000 --eval-iters 40 master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages Zarr-based strategies will not be registered because of missing packages using world size: 8, data-parallel-size: 8, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 add_bias_linear ................................. True add_position_embedding .......................... True adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_layernorm_1p .............................. False apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False async_tensor_model_parallel_allreduce ........... True attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False barrier_with_L1_time ............................ True bert_binary_head ................................ True bert_embedder_type .............................. megatron bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None check_for_nan_in_loss_and_grad .................. True classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 data_cache_path ................................. None data_parallel_random_init ....................... False data_parallel_size .............................. 8 data_path ....................................... ['/data/fsx/gpt2/my-gpt2_text_document'] data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single decoder_num_layers .............................. None decoder_seq_length .............................. None dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 distribute_saved_activations .................... False distributed_backend ............................. nccl distributed_timeout_minutes ..................... 10 embedding_path .................................. None embedding_weights_in_fp32 ....................... False empty_unused_memory_level ....................... 0 encoder_num_layers .............................. 24 encoder_seq_length .............................. 1024 end_weight_decay ................................ 0.01 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 40 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_on_missing_checkpoint ...................... False exit_signal_handler ............................. False ffn_hidden_size ................................. 4096 finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8 ............................................. None fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 64 gradient_accumulation_fusion .................... True group_query_attention ........................... False head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.1 hidden_size ..................................... 1024 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 iter_per_epoch .................................. 1250 kv_channels ..................................... 64 lazy_mpu_init ................................... None load ............................................ None local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 1 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.00015 lr_decay_iters .................................. 320000 lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_warmup_fraction .............................. 0.01 lr_warmup_init .................................. 0.0 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... True max_position_embeddings ......................... 1024 max_tokens_to_oom ............................... 12000 merge_file ...................................... /data/fsx/gpt2/gpt2-merges.txt micro_batch_size ................................ 8 min_loss_scale .................................. 1.0 min_lr .......................................... 1e-05 mmap_warmup ..................................... False no_load_optim ................................... None no_load_rng ..................................... None no_persist_layer_norm ........................... False no_save_optim ................................... None no_save_rng ..................................... None norm_epsilon .................................... 1e-05 normalization ................................... LayerNorm num_attention_heads ............................. 16 num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... None num_layers ...................................... 24 num_layers_per_virtual_pipeline_stage ........... None num_query_groups ................................ 1 num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam output_bert_embeddings .......................... False overlap_grad_reduce ............................. False overlap_p2p_comm ................................ False override_opt_param_scheduler .................... False params_dtype .................................... torch.float16 patch_dim ....................................... 16 perform_initialization .......................... True pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... learned_absolute profile ......................................... False profile_ranks ................................... [0] profile_step_end ................................ 12 profile_step_start .............................. 10 query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ None reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 retro_add_retriever ............................. False retro_cyclic_train_iters ........................ None retro_encoder_attention_dropout ................. 0.1 retro_encoder_hidden_dropout .................... 0.1 retro_encoder_layers ............................ 2 retro_num_neighbors ............................. 2 retro_num_retrieved_chunks ...................... 2 retro_return_doc_ids ............................ False retro_workdir ................................... None rotary_percent .................................. 1.0 rotary_seq_len_interpolation_factor ............. None sample_rate ..................................... 1.0 save ............................................ None save_interval ................................... 10000 scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 1024 sequence_parallel ............................... False sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 skip_train ...................................... False split ........................................... 949,50,1 squared_relu .................................... False standalone_embedding_stage ...................... False start_weight_decay .............................. 0.01 swiglu .......................................... False swin_backbone_type .............................. tiny tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None timing_log_level ................................ 0 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_model ................................. None tokenizer_type .................................. GPT2BPETokenizer train_data_path ................................. None train_iters ..................................... 500000 train_samples ................................... None transformer_impl ................................ local transformer_pipeline_model_parallel_size ........ 1 untie_embeddings_and_output_weights ............. False use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_cpu_initialization .......................... None use_distributed_optimizer ....................... False use_flash_attn .................................. False use_one_sent_docs ............................... False use_ring_exchange_p2p ........................... False use_rotary_position_embeddings .................. False valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... /data/fsx/gpt2/gpt2-vocab.json vocab_size ...................................... None weight_decay .................................... 0.01 weight_decay_incr_style ......................... constant world_size ...................................... 8 -------------------- end of arguments --------------------- setting number of micro-batches to constant 1
Environment (please complete the following information):
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context Add any other context about the problem here.