EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
https://www.eleuther.ai/
Apache License 2.0
6.95k stars 1.02k forks source link

Assertion Error when Setting pipe_parallel_size or model_parallel_size in GPT-NeoX #1251

Open lieh1203 opened 4 months ago

lieh1203 commented 4 months ago

Hello,

I am encountering an issue with the GPT-NeoX library. When I set either pipe_parallel_size or model_parallel_size to 2, I get the following assertion error:

[2024-07-11 06:28:59,215] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
NeoXArgs.from_ymls() ['configs/2-7B.yml', 'configs/local_setup.yml']
INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4
-------------------- arguments --------------------
  attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated
  batch_size ...................... 4...........................updated
  checkpoint_activations .......... True........................updated
  checkpoint_factor ............... 10000.......................updated
  config_files .................... {'2-7B.yml': '# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   "pipe_parallel_size": 2,\n   "model_parallel_size": 1,\n   #"gradient_accumulation_steps": 2,\n\n   # model settings\n   "num_layers": 32,\n   "hidden_size": 2560,\n   "num_attention_heads": 32,\n   "seq_length": 2048,\n   "max_position_embeddings": 2048,\n   "norm": "layernorm",\n   "pos_emb": "rotary",\n   "no_weight_tying": true,\n   "gpt_j_residual": false,\n   "output_layer_parallelism": "column",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   "scaled_upper_triang_masked_softmax_fusion": false,\n   "bias_gelu_fusion": false,\n   "rope_fusion": false,\n   "layernorm_fusion": false,\n\n   # init methods\n   "init_method": "small_init",\n   "output_layer_init_method": "wang_init",\n\n   # optimizer settings\n   "optimizer": {\n     "type": "Adam",\n     "params": {\n       "lr": 0.00016,\n       "betas": [0.9, 0.95],\n       "eps": 1.0e-8,\n     }\n   },\n   "min_lr": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   "zero_optimization": {\n    "stage": 3,\n    "allgather_partitions": True,\n    "allgather_bucket_size": 500000000,\n    "overlap_comm": True,\n    "reduce_scatter": True,\n    "reduce_bucket_size": 500000000,\n    "contiguous_gradients": True,\n  },\n\n   # batch / data settings\n   "train_micro_batch_size_per_gpu": 4,\n   "data_impl": "mmap",\n\n   # activation checkpointing\n   "checkpoint_activations": true,\n   "checkpoint_num_layers": 1,\n   "partition_activations": true,\n   "synchronize_each_layer": true,\n\n   # regularization\n   "gradient_clipping": 1.0,\n   "weight_decay": 0.1,\n   "hidden_dropout": 0,\n   "attention_dropout": 0,\n\n   # precision settings\n   "fp16": {\n     "fp16": true,\n     "enabled": true,\n     "loss_scale": 0,\n     "loss_scale_window": 1000,\n     "hysteresis": 2,\n     "min_loss_scale": 1\n   },\n\n   # misc. training settings\n   "train_iters": 320000,\n   "lr_decay_iters": 320000,\n   "distributed_backend": "nccl",\n   "lr_decay_style": "cosine",\n   "warmup": 0.01,\n   "checkpoint_factor": 10000,\n   "eval_interval": 1000,\n   "eval_iters": 10,\n\n   # logging\n   "log_interval": 100,\n   "steps_per_print": 10,\n   "keep_last_n_checkpoints": 4,\n   "wall_clock_breakdown": true,\n}\n', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n  "data_path": "data/processed_data/mydataset_text_document",\n  "tokenizer_type":"HFTokenizer",\n  # or for weighted datasets:\n  # "train-data-paths": ["data/enwik8/enwik8_text_document", "data/enwik8/enwik8_text_document"],\n  # "test-data-paths": ["data/enwik8/enwik8_text_document", "data/enwik8/enwik8_text_document"],\n  # "valid-data-paths": ["data/enwik8/enwik8_text_document", "data/enwik8/enwik8_text_document"],\n  # "train-data-weights": [1., 2.],\n  # "test-data-weights": [2., 1.],\n  # "valid-data-weights": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # "weight_by_num_documents": false,\n  # "weighted_sampler_alpha": 0.3,\n\n  "vocab_file": "ckpts/20B_tokenizer.json",\n  "merge_file": "data/gpt2-merges.txt",\n\n  "save": "checkpoints",\n  "load": "checkpoints",\n  "checkpoint_validation_with_forward_pass": False,\n\n  "tensorboard_dir": "tensorboard",\n  "log_dir": "logs",\n  "use_wandb": False,\n  "wandb_host": "https://api.wandb.ai",\n  "wandb_project": "neox"\n}\n'}updated
  data_impl ....................... mmap........................updated
  data_path ....................... data/processed_data/mydataset_text_documentupdated
  dynamic_loss_scale .............. True........................updated
  eval_iters ...................... 10..........................updated
  fp16 ............................ {'fp16': True, 'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
  global_num_gpus ................. 4...........................updated
  hidden_size ..................... 2560........................updated
  init_method ..................... small_init..................updated
  is_pipe_parallel ................ True........................updated
  keep_last_n_checkpoints ......... 4...........................updated
  load ............................ checkpoints.................updated
  log_dir ......................... logs........................updated
  lr .............................. 0.00016.....................updated
  lr_decay_iters .................. 320000......................updated
  lr_decay_style .................. cosine......................updated
  max_position_embeddings ......... 2048........................updated
  merge_file ...................... data/gpt2-merges.txt........updated
  min_lr .......................... 1.6e-05.....................updated
  no_weight_tying ................. True........................updated
  num_attention_heads ............. 32..........................updated
  num_layers ...................... 32..........................updated
  optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.00016, 'betas': [0.9, 0.95], 'eps': 1e-08}}updated
  optimizer_type .................. Adam........................updated
  output_layer_init_method ........ wang_init...................updated
  partition_activations ........... True........................updated
  pipe_parallel_size .............. 2...........................updated
  pos_emb ......................... rotary......................updated
  precision ....................... fp16........................updated
  save ............................ checkpoints.................updated
  save_iters ...................... [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000]updated
  seq_length ...................... 2048........................updated
  sparsity_config ................. {}..........................updated
  synchronize_each_layer .......... True........................updated
  tensorboard_dir ................. tensorboard.................updated
  text_gen_type ................... unconditional...............updated
  tokenizer_type .................. HFTokenizer.................updated
  train_batch_size ................ 8...........................updated
  train_iters ..................... 320000......................updated
  train_micro_batch_size_per_gpu .. 4...........................updated
  use_wandb ....................... False.......................updated
  user_script ..................... train.py....................updated
  vocab_file ...................... ckpts/20B_tokenizer.json....updated
  wall_clock_breakdown ............ True........................updated
  zero_allgather_bucket_size ...... 500000000...................updated
  zero_contiguous_gradients ....... True........................updated
  zero_optimization ............... {'stage': 3, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True}updated
  zero_reduce_bucket_size ......... 500000000...................updated
  zero_reduce_scatter ............. True........................updated
  zero_stage ...................... 3...........................updated
  account ......................... None........................default
  activation ...................... gelu........................default
  activation_checkpointing ........ None........................default
  adlr_autoresume ................. False.......................default
  adlr_autoresume_interval ........ 1000........................default
  amp ............................. None........................default
  apply_query_key_layer_scaling ... False.......................default
  attention_dropout ............... 0...........................default
  attention_softmax_in_fp32 ....... False.......................default
  autotuning ...................... None........................default
  autotuning_run .................. None........................default
  base_shapes_file ................ None........................default
  bf16 ............................ None........................default
  bias_dropout_fusion ............. False.......................default
  bias_gelu_fusion ................ False.......................default
  char_level_ppl .................. False.......................default
  checkpoint ...................... None........................default
  checkpoint_in_cpu ............... False.......................default
  checkpoint_num_layers ........... 1...........................default
  checkpoint_scale ................ linear......................default
  checkpoint_validation_with_forward_pass  False................default
  clip_grad ....................... 1.0.........................default
  comment ......................... None........................default
  comms_logger .................... None........................default
  communication_data_type ......... None........................default
  compression_training ............ None........................default
  contiguous_checkpointing ........ False.......................default
  coord_check ..................... False.......................default
  create_moe_param_group .......... True........................default
  csv_monitor ..................... None........................default
  curriculum_learning ............. None........................default
  curriculum_seqlen ............... 0...........................default
  data_efficiency ................. None........................default
  data_types ...................... None........................default
  deepscale ....................... False.......................default
  deepscale_config ................ None........................default
  deepspeed ....................... True........................default
  deepspeed_activation_checkpointing  True......................default
  deepspeed_extra_args ............ None........................default
  deepspeed_mpi ................... False.......................default
  deepspeed_slurm ................. False.......................default
  detect_nvlink_pairs ............. False.......................default
  distributed_backend ............. nccl........................default
  do_test ......................... None........................default
  do_train ........................ None........................default
  do_valid ........................ None........................default
  dump_state ...................... False.......................default
  elasticity ...................... None........................default
  enable_expert_tensor_parallelism  False.......................default
  eod_mask_loss ................... False.......................default
  eval_interval ................... 1000........................default
  eval_results_prefix ............. ............................default
  eval_tasks ...................... None........................default
  exclude ......................... None........................default
  exit_interval ................... None........................default
  expert_interval ................. 2...........................default
  extra_save_iters ................ None........................default
  finetune ........................ False.......................default
  flops_profiler .................. None........................default
  force_multi ..................... False.......................default
  fp16_lm_cross_entropy ........... False.......................default
  fp32_allreduce .................. False.......................default
  git_hash ........................ None........................default
  gmlp_attn_dim ................... 64..........................default
  gpt_j_residual .................. False.......................default
  gpt_j_tied ...................... False.......................default
  gradient_accumulation_steps ..... 1...........................default
  gradient_clipping ............... 1.0.........................default
  gradient_noise_scale_cpu_offload  False.......................default
  gradient_noise_scale_n_batches .. 5...........................default
  gradient_predivide_factor ....... 1.0.........................default
  hidden_dropout .................. 0...........................default
  hostfile ........................ None........................default
  hysteresis ...................... 2...........................default
  include ......................... None........................default
  init_method_std ................. 0.02........................default
  intermediate_size ............... None........................default
  iteration ....................... None........................default
  label_data_paths ................ None........................default
  launcher ........................ pdsh........................default
  layernorm_epsilon ............... 1e-05.......................default
  layernorm_fusion ................ False.......................default
  lazy_mpu_init ................... False.......................default
  local_rank ...................... None........................default
  log_grad_norm ................... False.......................default
  log_grad_pct_zeros .............. False.......................default
  log_gradient_noise_scale ........ False.......................default
  log_interval .................... 100.........................default
  log_optimizer_states ............ False.......................default
  log_param_norm .................. False.......................default
  loss_scale ...................... None........................default
  loss_scale_window ............... 1000.0......................default
  make_vocab_size_divisible_by .... 128.........................default
  mamba_causal_conv_fusion ........ False.......................default
  mamba_inner_func_fusion ......... False.......................default
  mamba_selective_fp32_params ..... True........................default
  mamba_selective_scan_fusion ..... False.......................default
  mamba_use_bias_in_conv .......... True........................default
  mamba_use_bias_in_linears ....... False.......................default
  master_addr ..................... None........................default
  master_port ..................... 29500.......................default
  maximum_tokens .................. 64..........................default
  memory_profiling ................ False.......................default
  memory_profiling_path ........... None........................default
  min_scale ....................... 1.0.........................default
  mlp_type ........................ regular.....................default
  mmap_warmup ..................... False.......................default
  model_parallel_size ............. 1...........................default
  moe_eval_capacity_factor ........ 1.0.........................default
  moe_expert_parallel_size ........ 1...........................default
  moe_glu ......................... False.......................default
  moe_jitter_eps .................. None........................default
  moe_lbl_in_fp32 ................. False.......................default
  moe_loss_coeff .................. 0.1.........................default
  moe_min_capacity ................ 4...........................default
  moe_num_experts ................. 1...........................default
  moe_token_dropping .............. False.......................default
  moe_top_k ....................... 1...........................default
  moe_train_capacity_factor ....... 1.0.........................default
  moe_type ........................ megablocks..................default
  moe_use_residual ................ True........................default
  mup_attn_temp ................... 1.0.........................default
  mup_embedding_mult .............. 1.0.........................default
  mup_init_scale .................. 1.0.........................default
  mup_output_temp ................. 1.0.........................default
  mup_rp_embedding_mult ........... 1.0.........................default
  mup_width_scale ................. 2...........................default
  no_load_optim ................... False.......................default
  no_load_rng ..................... False.......................default
  no_save_optim ................... False.......................default
  no_save_rng ..................... False.......................default
  no_ssh_check .................... False.......................default
  norm ............................ layernorm...................default
  num_gpus ........................ None........................default
  num_kv_heads .................... None........................default
  num_nodes ....................... -1..........................default
  num_samples ..................... 1...........................default
  num_unique_layers ............... None........................default
  num_workers ..................... 2...........................default
  onnx_safe ....................... False.......................default
  opt_pos_emb_offset .............. 0...........................default
  output_layer_parallelism ........ column......................default
  override_lr_scheduler ........... False.......................default
  padded_vocab_size ............... None........................default
  param_sharing_style ............. grouped.....................default
  pipe_partition_method ........... type:transformer|mlp........default
  prescale_gradients .............. False.......................default
  profile ......................... False.......................default
  profile_backward ................ False.......................default
  profile_step_start .............. 10..........................default
  profile_step_stop ............... 12..........................default
  prompt_end ...................... 
...........................default
  rank ............................ None........................default
  recompute ....................... False.......................default
  return_logits ................... False.......................default
  rms_norm_epsilon ................ 1e-08.......................default
  rope_fusion ..................... False.......................default
  rotary_emb_base ................. 10000.......................default
  rotary_pct ...................... 1.0.........................default
  rotary_save_freqs_buffer ........ False.......................default
  rpe_max_distance ................ 128.........................default
  rpe_num_buckets ................. 32..........................default
  s3_chunk_size ................... 104857600...................default
  s3_path ......................... None........................default
  sample_input_file ............... None........................default
  sample_output_file .............. samples.txt.................default
  save_base_shapes ................ False.......................default
  scaled_masked_softmax_fusion .... False.......................default
  scaled_upper_triang_masked_softmax_fusion  False..............default
  scalenorm_epsilon ............... 1e-08.......................default
  scheduler ....................... None........................default
  seed ............................ 1234........................default
  short_seq_prob .................. 0.1.........................default
  sliding_window_width ............ None........................default
  soft_prompt_tuning .............. None........................default
  sparse_attention ................ None........................default
  sparse_gradients ................ False.......................default
  split ........................... 969, 30, 1..................default
  steps_per_print ................. 10..........................default
  temperature ..................... 0.0.........................default
  tensorboard ..................... None........................default
  test_data_paths ................. None........................default
  test_data_weights ............... None........................default
  top_k ........................... 0...........................default
  top_p ........................... 0.0.........................default
  train_data_paths ................ None........................default
  train_data_weights .............. None........................default
  use_bias_in_attn_linear ......... True........................default
  use_bias_in_norms ............... True........................default
  use_bnb_optimizer ............... False.......................default
  use_checkpoint_lr_scheduler ..... False.......................default
  use_cpu_initialization .......... False.......................default
  use_mup ......................... False.......................default
  use_qk_layernorm ................ False.......................default
  use_shared_fs ................... True........................default
  use_tutel ....................... False.......................default
  valid_data_paths ................ None........................default
  valid_data_weights .............. None........................default
  wandb ........................... None........................default
  wandb_group ..................... None........................default
  wandb_host ...................... https://api.wandb.ai........default
  wandb_init_all_ranks ............ False.......................default
  wandb_project ................... neox........................default
  wandb_team ...................... None........................default
  warmup .......................... 0.01........................default
  weight_by_num_documents ......... False.......................default
  weight_decay .................... 0.1.........................default
  weighted_sampler_alpha .......... 1.0.........................default
  world_size ...................... None........................default
---------------- end of arguments ----------------
NeoXArgs.configure_distributed_args() using world size: 1 and model-parallel size: 1 
[2024-07-11 06:29:02,427] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-07-11 06:29:02,427] [INFO] [runner.py:568:main] cmd = /usr/local/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --deepspeed_config eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDAxNiwgImJldGFzIjogWzAuOSwgMC45NV0sICJlcHMiOiAxZS0wOH19LCAiZnAxNiI6IHsiZnAxNiI6IHRydWUsICJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMywgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ== --megatron_config {"train_batch_size": 8, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.00016, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "zero_optimization": {"stage": 3, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "init_method": "small_init", "output_layer_init_method": "wang_init", "lr_decay_style": "cosine", "lr_decay_iters": 320000, "min_lr": 1.6e-05, "optimizer_type": "Adam", "zero_stage": 3, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "tokenizer_type": "HFTokenizer", "data_path": "data/processed_data/mydataset_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"2-7B.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe_parallel_size\": 2,\n   \"model_parallel_size\": 1,\n   #\"gradient_accumulation_steps\": 2,\n\n   # model settings\n   \"num_layers\": 32,\n   \"hidden_size\": 2560,\n   \"num_attention_heads\": 32,\n   \"seq_length\": 2048,\n   \"max_position_embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos_emb\": \"rotary\",\n   \"no_weight_tying\": true,\n   \"gpt_j_residual\": false,\n   \"output_layer_parallelism\": \"column\",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled_upper_triang_masked_softmax_fusion\": false,\n   \"bias_gelu_fusion\": false,\n   \"rope_fusion\": false,\n   \"layernorm_fusion\": false,\n\n   # init methods\n   \"init_method\": \"small_init\",\n   \"output_layer_init_method\": \"wang_init\",\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.00016,\n       \"betas\": [0.9, 0.95],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"min_lr\": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   \"zero_optimization\": {\n    \"stage\": 3,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data_impl\": \"mmap\",\n\n   # activation checkpointing\n   \"checkpoint_activations\": true,\n   \"checkpoint_num_layers\": 1,\n   \"partition_activations\": true,\n   \"synchronize_each_layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight_decay\": 0.1,\n   \"hidden_dropout\": 0,\n   \"attention_dropout\": 0,\n\n   # precision settings\n   \"fp16\": {\n     \"fp16\": true,\n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train_iters\": 320000,\n   \"lr_decay_iters\": 320000,\n   \"distributed_backend\": \"nccl\",\n   \"lr_decay_style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"checkpoint_factor\": 10000,\n   \"eval_interval\": 1000,\n   \"eval_iters\": 10,\n\n   # logging\n   \"log_interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep_last_n_checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data_path\": \"data/processed_data/mydataset_text_document\",\n  \"tokenizer_type\":\"HFTokenizer\",\n  # or for weighted datasets:\n  # \"train-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"test-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"valid-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab_file\": \"ckpts/20B_tokenizer.json\",\n  \"merge_file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n\n  \"tensorboard_dir\": \"tensorboard\",\n  \"log_dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "checkpoint_factor": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "vocab_file": "ckpts/20B_tokenizer.json", "merge_file": "data/gpt2-merges.txt", "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "world_size": 1, "is_pipe_parallel": true, "use_wandb": false, "log_dir": "logs", "tensorboard_dir": "tensorboard", "text_gen_type": "unconditional", "local_rank": 0, "rank": 0, "user_script": "train.py", "save_iters": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000], "global_num_gpus": 4}
[2024-07-11 06:29:04,147] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.16.2-1
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.16.2-1
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.16.2-1+cuda11.8
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.16.2-1+cuda11.8
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NCCL_VERSION=2.16.2-1
[2024-07-11 06:29:06,944] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-07-11 06:29:06,944] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-07-11 06:29:06,944] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-07-11 06:29:06,944] [INFO] [launch.py:164:main] dist_world_size=4
[2024-07-11 06:29:06,944] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-07-11 06:29:06,954] [INFO] [launch.py:256:main] process 2159 spawned with command: ['/usr/local/miniconda3/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDAxNiwgImJldGFzIjogWzAuOSwgMC45NV0sICJlcHMiOiAxZS0wOH19LCAiZnAxNiI6IHsiZnAxNiI6IHRydWUsICJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMywgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '{"train_batch_size": 8, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.00016, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "zero_optimization": {"stage": 3, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "init_method": "small_init", "output_layer_init_method": "wang_init", "lr_decay_style": "cosine", "lr_decay_iters": 320000, "min_lr": 1.6e-05, "optimizer_type": "Adam", "zero_stage": 3, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "tokenizer_type": "HFTokenizer", "data_path": "data/processed_data/mydataset_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"2-7B.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe_parallel_size\": 2,\n   \"model_parallel_size\": 1,\n   #\"gradient_accumulation_steps\": 2,\n\n   # model settings\n   \"num_layers\": 32,\n   \"hidden_size\": 2560,\n   \"num_attention_heads\": 32,\n   \"seq_length\": 2048,\n   \"max_position_embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos_emb\": \"rotary\",\n   \"no_weight_tying\": true,\n   \"gpt_j_residual\": false,\n   \"output_layer_parallelism\": \"column\",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled_upper_triang_masked_softmax_fusion\": false,\n   \"bias_gelu_fusion\": false,\n   \"rope_fusion\": false,\n   \"layernorm_fusion\": false,\n\n   # init methods\n   \"init_method\": \"small_init\",\n   \"output_layer_init_method\": \"wang_init\",\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.00016,\n       \"betas\": [0.9, 0.95],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"min_lr\": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   \"zero_optimization\": {\n    \"stage\": 3,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data_impl\": \"mmap\",\n\n   # activation checkpointing\n   \"checkpoint_activations\": true,\n   \"checkpoint_num_layers\": 1,\n   \"partition_activations\": true,\n   \"synchronize_each_layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight_decay\": 0.1,\n   \"hidden_dropout\": 0,\n   \"attention_dropout\": 0,\n\n   # precision settings\n   \"fp16\": {\n     \"fp16\": true,\n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train_iters\": 320000,\n   \"lr_decay_iters\": 320000,\n   \"distributed_backend\": \"nccl\",\n   \"lr_decay_style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"checkpoint_factor\": 10000,\n   \"eval_interval\": 1000,\n   \"eval_iters\": 10,\n\n   # logging\n   \"log_interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep_last_n_checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data_path\": \"data/processed_data/mydataset_text_document\",\n  \"tokenizer_type\":\"HFTokenizer\",\n  # or for weighted datasets:\n  # \"train-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"test-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"valid-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab_file\": \"ckpts/20B_tokenizer.json\",\n  \"merge_file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n\n  \"tensorboard_dir\": \"tensorboard\",\n  \"log_dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "checkpoint_factor": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "vocab_file": "ckpts/20B_tokenizer.json", "merge_file": "data/gpt2-merges.txt", "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "world_size": 1, "is_pipe_parallel": true, "use_wandb": false, "log_dir": "logs", "tensorboard_dir": "tensorboard", "text_gen_type": "unconditional", "local_rank": 0, "rank": 0, "user_script": "train.py", "save_iters": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000], "global_num_gpus": 4}']
[2024-07-11 06:29:06,961] [INFO] [launch.py:256:main] process 2160 spawned with command: ['/usr/local/miniconda3/bin/python', '-u', 'train.py', '--local_rank=1', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDAxNiwgImJldGFzIjogWzAuOSwgMC45NV0sICJlcHMiOiAxZS0wOH19LCAiZnAxNiI6IHsiZnAxNiI6IHRydWUsICJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMywgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '{"train_batch_size": 8, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.00016, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "zero_optimization": {"stage": 3, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "init_method": "small_init", "output_layer_init_method": "wang_init", "lr_decay_style": "cosine", "lr_decay_iters": 320000, "min_lr": 1.6e-05, "optimizer_type": "Adam", "zero_stage": 3, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "tokenizer_type": "HFTokenizer", "data_path": "data/processed_data/mydataset_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"2-7B.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe_parallel_size\": 2,\n   \"model_parallel_size\": 1,\n   #\"gradient_accumulation_steps\": 2,\n\n   # model settings\n   \"num_layers\": 32,\n   \"hidden_size\": 2560,\n   \"num_attention_heads\": 32,\n   \"seq_length\": 2048,\n   \"max_position_embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos_emb\": \"rotary\",\n   \"no_weight_tying\": true,\n   \"gpt_j_residual\": false,\n   \"output_layer_parallelism\": \"column\",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled_upper_triang_masked_softmax_fusion\": false,\n   \"bias_gelu_fusion\": false,\n   \"rope_fusion\": false,\n   \"layernorm_fusion\": false,\n\n   # init methods\n   \"init_method\": \"small_init\",\n   \"output_layer_init_method\": \"wang_init\",\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.00016,\n       \"betas\": [0.9, 0.95],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"min_lr\": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   \"zero_optimization\": {\n    \"stage\": 3,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data_impl\": \"mmap\",\n\n   # activation checkpointing\n   \"checkpoint_activations\": true,\n   \"checkpoint_num_layers\": 1,\n   \"partition_activations\": true,\n   \"synchronize_each_layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight_decay\": 0.1,\n   \"hidden_dropout\": 0,\n   \"attention_dropout\": 0,\n\n   # precision settings\n   \"fp16\": {\n     \"fp16\": true,\n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train_iters\": 320000,\n   \"lr_decay_iters\": 320000,\n   \"distributed_backend\": \"nccl\",\n   \"lr_decay_style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"checkpoint_factor\": 10000,\n   \"eval_interval\": 1000,\n   \"eval_iters\": 10,\n\n   # logging\n   \"log_interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep_last_n_checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data_path\": \"data/processed_data/mydataset_text_document\",\n  \"tokenizer_type\":\"HFTokenizer\",\n  # or for weighted datasets:\n  # \"train-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"test-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"valid-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab_file\": \"ckpts/20B_tokenizer.json\",\n  \"merge_file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n\n  \"tensorboard_dir\": \"tensorboard\",\n  \"log_dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "checkpoint_factor": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "vocab_file": "ckpts/20B_tokenizer.json", "merge_file": "data/gpt2-merges.txt", "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "world_size": 1, "is_pipe_parallel": true, "use_wandb": false, "log_dir": "logs", "tensorboard_dir": "tensorboard", "text_gen_type": "unconditional", "local_rank": 0, "rank": 0, "user_script": "train.py", "save_iters": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000], "global_num_gpus": 4}']
[2024-07-11 06:29:06,967] [INFO] [launch.py:256:main] process 2161 spawned with command: ['/usr/local/miniconda3/bin/python', '-u', 'train.py', '--local_rank=2', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDAxNiwgImJldGFzIjogWzAuOSwgMC45NV0sICJlcHMiOiAxZS0wOH19LCAiZnAxNiI6IHsiZnAxNiI6IHRydWUsICJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMywgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '{"train_batch_size": 8, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.00016, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "zero_optimization": {"stage": 3, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "init_method": "small_init", "output_layer_init_method": "wang_init", "lr_decay_style": "cosine", "lr_decay_iters": 320000, "min_lr": 1.6e-05, "optimizer_type": "Adam", "zero_stage": 3, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "tokenizer_type": "HFTokenizer", "data_path": "data/processed_data/mydataset_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"2-7B.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe_parallel_size\": 2,\n   \"model_parallel_size\": 1,\n   #\"gradient_accumulation_steps\": 2,\n\n   # model settings\n   \"num_layers\": 32,\n   \"hidden_size\": 2560,\n   \"num_attention_heads\": 32,\n   \"seq_length\": 2048,\n   \"max_position_embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos_emb\": \"rotary\",\n   \"no_weight_tying\": true,\n   \"gpt_j_residual\": false,\n   \"output_layer_parallelism\": \"column\",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled_upper_triang_masked_softmax_fusion\": false,\n   \"bias_gelu_fusion\": false,\n   \"rope_fusion\": false,\n   \"layernorm_fusion\": false,\n\n   # init methods\n   \"init_method\": \"small_init\",\n   \"output_layer_init_method\": \"wang_init\",\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.00016,\n       \"betas\": [0.9, 0.95],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"min_lr\": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   \"zero_optimization\": {\n    \"stage\": 3,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data_impl\": \"mmap\",\n\n   # activation checkpointing\n   \"checkpoint_activations\": true,\n   \"checkpoint_num_layers\": 1,\n   \"partition_activations\": true,\n   \"synchronize_each_layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight_decay\": 0.1,\n   \"hidden_dropout\": 0,\n   \"attention_dropout\": 0,\n\n   # precision settings\n   \"fp16\": {\n     \"fp16\": true,\n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train_iters\": 320000,\n   \"lr_decay_iters\": 320000,\n   \"distributed_backend\": \"nccl\",\n   \"lr_decay_style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"checkpoint_factor\": 10000,\n   \"eval_interval\": 1000,\n   \"eval_iters\": 10,\n\n   # logging\n   \"log_interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep_last_n_checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data_path\": \"data/processed_data/mydataset_text_document\",\n  \"tokenizer_type\":\"HFTokenizer\",\n  # or for weighted datasets:\n  # \"train-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"test-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"valid-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab_file\": \"ckpts/20B_tokenizer.json\",\n  \"merge_file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n\n  \"tensorboard_dir\": \"tensorboard\",\n  \"log_dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "checkpoint_factor": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "vocab_file": "ckpts/20B_tokenizer.json", "merge_file": "data/gpt2-merges.txt", "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "world_size": 1, "is_pipe_parallel": true, "use_wandb": false, "log_dir": "logs", "tensorboard_dir": "tensorboard", "text_gen_type": "unconditional", "local_rank": 0, "rank": 0, "user_script": "train.py", "save_iters": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000], "global_num_gpus": 4}']
[2024-07-11 06:29:06,973] [INFO] [launch.py:256:main] process 2162 spawned with command: ['/usr/local/miniconda3/bin/python', '-u', 'train.py', '--local_rank=3', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDAxNiwgImJldGFzIjogWzAuOSwgMC45NV0sICJlcHMiOiAxZS0wOH19LCAiZnAxNiI6IHsiZnAxNiI6IHRydWUsICJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMywgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '{"train_batch_size": 8, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.00016, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "zero_optimization": {"stage": 3, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "init_method": "small_init", "output_layer_init_method": "wang_init", "lr_decay_style": "cosine", "lr_decay_iters": 320000, "min_lr": 1.6e-05, "optimizer_type": "Adam", "zero_stage": 3, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "tokenizer_type": "HFTokenizer", "data_path": "data/processed_data/mydataset_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"2-7B.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe_parallel_size\": 2,\n   \"model_parallel_size\": 1,\n   #\"gradient_accumulation_steps\": 2,\n\n   # model settings\n   \"num_layers\": 32,\n   \"hidden_size\": 2560,\n   \"num_attention_heads\": 32,\n   \"seq_length\": 2048,\n   \"max_position_embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos_emb\": \"rotary\",\n   \"no_weight_tying\": true,\n   \"gpt_j_residual\": false,\n   \"output_layer_parallelism\": \"column\",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled_upper_triang_masked_softmax_fusion\": false,\n   \"bias_gelu_fusion\": false,\n   \"rope_fusion\": false,\n   \"layernorm_fusion\": false,\n\n   # init methods\n   \"init_method\": \"small_init\",\n   \"output_layer_init_method\": \"wang_init\",\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.00016,\n       \"betas\": [0.9, 0.95],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"min_lr\": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   \"zero_optimization\": {\n    \"stage\": 3,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data_impl\": \"mmap\",\n\n   # activation checkpointing\n   \"checkpoint_activations\": true,\n   \"checkpoint_num_layers\": 1,\n   \"partition_activations\": true,\n   \"synchronize_each_layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight_decay\": 0.1,\n   \"hidden_dropout\": 0,\n   \"attention_dropout\": 0,\n\n   # precision settings\n   \"fp16\": {\n     \"fp16\": true,\n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train_iters\": 320000,\n   \"lr_decay_iters\": 320000,\n   \"distributed_backend\": \"nccl\",\n   \"lr_decay_style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"checkpoint_factor\": 10000,\n   \"eval_interval\": 1000,\n   \"eval_iters\": 10,\n\n   # logging\n   \"log_interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep_last_n_checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data_path\": \"data/processed_data/mydataset_text_document\",\n  \"tokenizer_type\":\"HFTokenizer\",\n  # or for weighted datasets:\n  # \"train-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"test-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"valid-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab_file\": \"ckpts/20B_tokenizer.json\",\n  \"merge_file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n\n  \"tensorboard_dir\": \"tensorboard\",\n  \"log_dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "checkpoint_factor": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "vocab_file": "ckpts/20B_tokenizer.json", "merge_file": "data/gpt2-merges.txt", "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "world_size": 1, "is_pipe_parallel": true, "use_wandb": false, "log_dir": "logs", "tensorboard_dir": "tensorboard", "text_gen_type": "unconditional", "local_rank": 0, "rank": 0, "user_script": "train.py", "save_iters": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000], "global_num_gpus": 4}']
[2024-07-11 06:29:09,055] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-11 06:29:09,059] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-11 06:29:09,076] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-11 06:29:09,084] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Unable to import Mamba kernels. Install them from our requirements/requirements-mamba.txt, or directly from https://github.com/state-spaces/mamba
Unable to import Mamba kernels. Install them from our requirements/requirements-mamba.txt, or directly from https://github.com/state-spaces/mamba
For s3 checkpointing, please install boto3 either using requirements/requirements-s3.txt or https://github.com/boto/boto3
For s3 checkpointing, please install hf_transfer either using requirements/requirements-s3.txt or https://github.com/huggingface/hf_transfer
For s3 checkpointing, please install boto3 either using requirements/requirements-s3.txt or https://github.com/boto/boto3
For s3 checkpointing, please install hf_transfer either using requirements/requirements-s3.txt or https://github.com/huggingface/hf_transfer
[2024-07-11 06:29:12,526] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-11 06:29:12,526] [INFO] [comm.py:637:init_distributed] cdb=None
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Unable to import Mamba kernels. Install them from our requirements/requirements-mamba.txt, or directly from https://github.com/state-spaces/mamba
For s3 checkpointing, please install boto3 either using requirements/requirements-s3.txt or https://github.com/boto/boto3
For s3 checkpointing, please install hf_transfer either using requirements/requirements-s3.txt or https://github.com/huggingface/hf_transfer
[2024-07-11 06:29:12,613] [INFO] [comm.py:637:init_distributed] cdb=None
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Unable to import Mamba kernels. Install them from our requirements/requirements-mamba.txt, or directly from https://github.com/state-spaces/mamba
For s3 checkpointing, please install boto3 either using requirements/requirements-s3.txt or https://github.com/boto/boto3
For s3 checkpointing, please install hf_transfer either using requirements/requirements-s3.txt or https://github.com/huggingface/hf_transfer
[2024-07-11 06:29:12,664] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-11 06:29:12,664] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1 
> building HFTokenizer tokenizer ...
 > padded vocab (size: 50277) with 27 dummy tokens (new size: 50304)
> setting tensorboard ...
torch distributed is already initialized, skipping initialization ...
> initializing model parallel with size 1
MPU DP: [0, 1]
MPU DP: [2, 3]
MPU PP: [0, 2]
MPU PP: [1, 3]
MPU IO: [0, 1, 2, 3]
MPU MP: [0]
MPU MP: [1]
MPU MP: [2]
MPU MP: [3]
> setting random seeds to 1234 ...
[2024-07-11 06:29:13,784] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/hy-tmp/gpt-neox-main/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/hy-tmp/gpt-neox-main/megatron/data'
Traceback (most recent call last):
  File "train.py", line 35, in <module>
    main()
  File "train.py", line 31, in main
    pretrain(neox_args=neox_args)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 195, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 743, in setup_model_and_optimizer
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 486, in get_model
    with deepspeed.zero.Init(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 796, in __init__
    self._configure_train_batch_size()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 979, in _configure_train_batch_size
    self._batch_assertion()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 925, in _batch_assertion
    assert (grad_acc > 0), f"Gradient accumulation steps: {grad_acc} has to be greater than 0"
AssertionError: Gradient accumulation steps: 0 has to be greater than 0
Traceback (most recent call last):
  File "train.py", line 35, in <module>
    main()
  File "train.py", line 31, in main
    pretrain(neox_args=neox_args)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 195, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 743, in setup_model_and_optimizer
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 486, in get_model
    with deepspeed.zero.Init(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 796, in __init__
    self._configure_train_batch_size()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 979, in _configure_train_batch_size
    self._batch_assertion()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 925, in _batch_assertion
    assert (grad_acc > 0), f"Gradient accumulation steps: {grad_acc} has to be greater than 0"
AssertionError: Gradient accumulation steps: 0 has to be greater than 0
Traceback (most recent call last):
  File "train.py", line 35, in <module>
    main()
  File "train.py", line 31, in main
    pretrain(neox_args=neox_args)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 195, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 743, in setup_model_and_optimizer
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 486, in get_model
    with deepspeed.zero.Init(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 796, in __init__
    self._configure_train_batch_size()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 979, in _configure_train_batch_size
    self._batch_assertion()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 925, in _batch_assertion
    assert (grad_acc > 0), f"Gradient accumulation steps: {grad_acc} has to be greater than 0"
AssertionError: Gradient accumulation steps: 0 has to be greater than 0
building GPT2 model ...
Traceback (most recent call last):
  File "train.py", line 35, in <module>
    main()
  File "train.py", line 31, in main
    pretrain(neox_args=neox_args)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 195, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 743, in setup_model_and_optimizer
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 486, in get_model
    with deepspeed.zero.Init(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 796, in __init__
    self._configure_train_batch_size()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 979, in _configure_train_batch_size
    self._batch_assertion()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 925, in _batch_assertion
    assert (grad_acc > 0), f"Gradient accumulation steps: {grad_acc} has to be greater than 0"
AssertionError: Gradient accumulation steps: 0 has to be greater than 0

I am trying to enable parallelism but this error is preventing me from proceeding.

Here are some details about my setup:

Below is the content of my 2-7B.yml configuration file:

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe_parallel_size": 2,
   "model_parallel_size": 1,

   # model settings
   "num_layers": 32,
   "hidden_size": 2560,
   "num_attention_heads": 32,
   "seq_length": 2048,
   "max_position_embeddings": 2048,
   "norm": "layernorm",
   "pos_emb": "rotary",
   "no_weight_tying": true,
   "gpt_j_residual": false,
   "output_layer_parallelism": "column",

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled_upper_triang_masked_softmax_fusion": false,
   "bias_gelu_fusion": false,
   "rope_fusion": false,
   "layernorm_fusion": false,

   # init methods
   "init_method": "small_init",
   "output_layer_init_method": "wang_init",

   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.00016,
       "betas": [0.9, 0.95],
       "eps": 1.0e-8,
     }
   },
   "min_lr": 0.000016,

   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
   "zero_optimization": {
    "stage": 3,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 4,
   "data_impl": "mmap",

   # activation checkpointing
   "checkpoint_activations": true,
   "checkpoint_num_layers": 1,
   "partition_activations": true,
   "synchronize_each_layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight_decay": 0.1,
   "hidden_dropout": 0,
   "attention_dropout": 0,

   # precision settings
   "fp16": {
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train_iters": 320000,
   "lr_decay_iters": 320000,
   "distributed_backend": "nccl",
   "lr_decay_style": "cosine",
   "warmup": 0.01,
   "checkpoint_factor": 10000,
   "eval_interval": 1000,
   "eval_iters": 10,

   # logging
   "log_interval": 100,
   "steps_per_print": 10,
   "keep_last_n_checkpoints": 4,
   "wall_clock_breakdown": true,
}

I am using the following command to start the training:

python ./deepy.py train.py -d configs 2-7B.yml local_setup.yml

I would appreciate any guidance or suggestions on how to resolve this issue.

Thank you!

StellaAthena commented 4 months ago

What happens if you add "gradient_accumulation_steps": 16 to your config file?

lieh1203 commented 4 months ago

Hi @StellaAthena!

Is it because I enabled both parallel mode and Zero Stage 3 at the same time that caused this error?

StellaAthena commented 4 months ago

Hi @StellaAthena!

Is it because I enabled both parallel mode and Zero Stage 3 at the same time that caused this error?

Zero-3 and PP > 1 should error but I'm surprised it would error like this? Does it go away if you use zero-1?