Error when train galvatron with global mode CUDA error: uncorrectable ECC error encountered

I set the environment variables as follow in train_dist.sh in gpt_hf folder:

export NUM_NODES=1
export NUM_GPUS_PER_NODE=8
export MASTER_ADDR=localhost
export MASTER_PORT=2222
export NODE_RANK=0
export CUDA_DEVICE_MAX_CONNECTIONS=1

I used a cluster with 8 NVIDIA RTX A6000 and cuda version is release 11.8, V11.8.89 Then I encoutered error as follow:

and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
using world size: 8, data-parallel size: 8, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
setting global batch size to 8
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adam_weight_decay ............................... 0.01
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  allow_tf32 ...................................... 1
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. True
  check_loss ...................................... 0
  chunks .......................................... 8
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 8
  data_path ....................................... None
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  default_dp_type ................................. zero2
  delay_grad_reduce ............................... True
  delay_param_gather .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  dropout_prob .................................... 0.1
  embed_sdp ....................................... 0
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 24
  encoder_seq_length .............................. None
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  epochs .......................................... 10
  eval_interval ................................... 1000
  eval_iters ...................................... 100
  evidence_data_path .............................. None
  exit_after_profiling ............................ 1
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  ffn_hidden_size ................................. 6400
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  galvatron_config_path ........................... None
  global_batch_size ............................... 8
  global_checkpoint ............................... 0
  global_tp_consec ................................ 1
  global_tp_deg ................................... 2
  global_train_batch_size ......................... 16
  gradient_accumulation_fusion .................... False
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 1600
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  initialize_on_meta .............................. 0
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 50
  lazy_mpu_init ................................... None
  load ............................................ None
  load_params ..................................... 0
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_throughput .................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.0001
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. None
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 512
  max_predictions_per_seq ......................... 20
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mixed_precision ................................. bf16
  model_size ...................................... gpt-0.3b
  nccl_communicator_config_path ................... None
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... LayerNorm
  num_attention_heads ............................. 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_hidden_layers ............................... 12
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.float32
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  pipeline_type ................................... pipedream_flush
  position_embedding_type ......................... learned_absolute
  pp_deg .......................................... 2
  profile ......................................... 1
  profile_forward ................................. 0
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  profile_type .................................... allocated
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_verify_neighbor_count ..................... True
  retro_workdir ................................... None
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... None
  save_profiled_memory ............................ 0
  scatter_gather_tensors_in_pipeline .............. True
  sdp ............................................. 0
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... True
  set_layernum_manually ........................... 0
  set_model_config_manually ....................... 0
  sgd_momentum .................................... 0.9
  shape_order ..................................... SBH
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  spec ............................................ None
  split ........................................... 969, 30, 1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  swiglu .......................................... False
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. BertWordPieceLowerCase
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_cfg ............................. None
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... None
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... True
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. True
  use_mcore_models ................................ False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... 50257
  vocab_tp ........................................ 4
  wandb_exp_name .................................. 
  wandb_project ................................... 
  wandb_save_dir .................................. 
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 8
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
> initializing torch distributed ...
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "head_dim": 64,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_embd": 1024,
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.41.2",
  "use_cache": false,
  "vocab_size": 50257
}

------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adam_weight_decay ............................... 0.01
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  allow_tf32 ...................................... 1
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. True
  check_loss ...................................... 0
  chunks .......................................... 8
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 8
  data_path ....................................... None
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  default_dp_type ................................. zero2
  delay_grad_reduce ............................... True
  delay_param_gather .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  dropout_prob .................................... 0.1
  embed_sdp ....................................... 0
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 24
  encoder_seq_length .............................. None
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  epochs .......................................... 10
  eval_interval ................................... 1000
  eval_iters ...................................... 100
  evidence_data_path .............................. None
  exit_after_profiling ............................ 1
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  ffn_hidden_size ................................. 6400
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  galvatron_config_path ........................... None
  global_batch_size ............................... 8
  global_checkpoint ............................... 0
  global_tp_consec ................................ 1
  global_tp_deg ................................... 2
  global_train_batch_size ......................... 16
  gradient_accumulation_fusion .................... False
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 1024
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  initialize_on_meta .............................. 0
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 64
  lazy_mpu_init ................................... None
  load ............................................ None
  load_params ..................................... 0
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_throughput .................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.0001
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. None
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 512
  max_predictions_per_seq ......................... 20
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mixed_precision ................................. bf16
  model_size ...................................... gpt-0.3b
  nccl_communicator_config_path ................... None
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... LayerNorm
  num_attention_heads ............................. 16
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_hidden_layers ............................... 24
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ False
  override_opt_param_scheduler .................... False
  padded_vocab_size ............................... 50304
  params_dtype .................................... torch.float32
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  pipeline_type ................................... pipedream_flush
  position_embedding_type ......................... learned_absolute
  pp_deg .......................................... 2
  profile ......................................... 1
  profile_forward ................................. 0
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  profile_type .................................... allocated
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_verify_neighbor_count ..................... True
  retro_workdir ................................... None
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... None
  save_profiled_memory ............................ 0
  scatter_gather_tensors_in_pipeline .............. True
  sdp ............................................. 0
  seed ............................................ 1234
  seq_length ...................................... 1024
  sequence_parallel ............................... True
  set_layernum_manually ........................... 0
  set_model_config_manually ....................... 0
  sgd_momentum .................................... 0.9
  shape_order ..................................... SBH
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  spec ............................................ None
  split ........................................... 969, 30, 1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  swiglu .......................................... False
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. BertWordPieceLowerCase
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_cfg ............................. None
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... None
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... True
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. True
  use_mcore_models ................................ False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... 50257
  vocab_tp ........................................ 4
  wandb_exp_name .................................. 
  wandb_project ................................... 
  wandb_save_dir .................................. 
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 8
-------------------- end of arguments ---------------------
======================== Galvatron Parallel Config =============================
Galvatron parallel config mode: [GLOBAL config mode]
[GLOBAL config mode] Loaded global hybrid parallel strategy:
   global_batch_size: 16, chunks: 8
   pp_deg: 2, tp_deg: 2, dp_deg: 2, tp_consecutive_flag: 1, checkpoint_flag: 0
   pipeline_type: pipedream_flush, default_dp_type: zero2, dtype: bf16
   pp_division:                  [12, 12]
   pp_ranks:                     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
================================================================================
Creating Model...
Model Layer Types:
['embed', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'norm', 'cls']
   tp_sizes_whole:               [4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4]
   tp_consec_whole:              [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
   dp_types_whole:               [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   pp_ranks_whole:               [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
   checkpoint_flags_whole:       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   dp_sizes_whole:               [1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1]
================================================================================
====================== Galvatron Communication Group ===========================
TP groups for rank 5 (all layers):
[4, 5, 6, 7] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5, 6, 7] [4, 5, 6, 7] 
DP groups for rank 5 (all layers):
[5] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5] [5] 
Split groups for rank 5:
None [4, 5, 6, 7] None None None None None None None None None None None None None None None None None None None None None None None [4, 5] None 
AllGather groups for rank 5:
None [4, 5] None None None None None None None None None None None None None None None None None None None None None None None [4, 5, 6, 7] None 
Fused split groups for rank 5:
None [5, 7] None None None None None None None None None None None None None None None None None None None None None None None None None 
Fused allgather groups for rank 5:
None None None None None None None None None None None None None None None None None None None None None None None None None [5, 7] None 
================================================================================
Traceback (most recent call last):
  File "train_dist.py", line 109, in <module>
    train(args)
  File "train_dist.py", line 55, in train
    model = construct_hybrid_parallel_model(
  File "/home/wyr/Hetu-Galvatron/galvatron/models/gpt_hf/GPTModel_hybrid_parallel.py", line 12, in construct_hybrid_parallel_model
    hp_model = construct_hybrid_parallel_model_api(
  File "/home/wyr/Hetu-Galvatron/galvatron/core/hybrid_parallel_model.py", line 138, in construct_hybrid_parallel_model_api
    hp_model.wrap_pipeline_modules_data_parallel(
  File "/home/wyr/Hetu-Galvatron/galvatron/core/pipeline/pipeline.py", line 127, in wrap_pipeline_modules_data_parallel
    self.model_cur_stage = wrap_modules_data_parallel(
  File "/home/wyr/Hetu-Galvatron/galvatron/core/parallel.py", line 185, in wrap_modules_data_parallel
    module_list[i] = wrap_data_parallel(module_list[i], dp_types[i], dp_groups[i], module_type=module_types[i], pp_device = pp_device, mixed_precision=mixed_precision, pp_on=pp_on, wrap_block_name=wrap_block_name)
  File "/home/wyr/Hetu-Galvatron/galvatron/core/parallel.py", line 20, in wrap_data_parallel
    return wrap_module_fsdp_manually(module, pp_device, module_type, dp_group, fsdp_type=fsdp_type_dict[dp_type], mixed_precision=mixed_precision, pp_on=pp_on, wrap_block_name=wrap_block_name)
  File "/home/wyr/Hetu-Galvatron/galvatron/core/parallel.py", line 54, in wrap_module_fsdp_manually
    module = apply_fsdp(module, fsdp_args, wrap_block_name)
  File "/home/wyr/Hetu-Galvatron/galvatron/core/parallel.py", line 98, in apply_fsdp
    _recursive_wrap(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 388, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 317, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 408, in __init__
    _init_param_handle_from_module(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 415, in _init_param_handle_from_module
    _move_module_to_device(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 802, in _move_module_to_device
    module = module.to(device_from_device_id)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: uncorrectable ECC error encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
Creating Dataset...
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
After creating model [Allocated]
        Max memory: 298.26 MB   Current memory : 248.26 MB

Before Forward [Allocated]
        Max memory: 248.39 MB   Current memory : 248.39 MB
After creating model [Allocated]
        Max memory: 319.77 MB   Current memory : 252.25 MB
Start training...

Before Forward [Allocated]
        Max memory: 252.38 MB   Current memory : 252.38 MB
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943546 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943553 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943554 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943555 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943556 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943557 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943562 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 6 (pid: 2943559) of binary: /home/wyr/anaconda3/envs/galvatron/bin/python3
Traceback (most recent call last):
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Could you please help me resolve this issue? Or could you provide some possible solutions? Thank you for your help and support!

PKU-DAIR / Hetu-Galvatron

Error when train galvatron with global mode CUDA error: uncorrectable ECC error encountered #3