EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
https://www.eleuther.ai/
Apache License 2.0
6.93k stars 1.01k forks source link

ZeRO 2 cpu_offload causes RuntimeError: expected input to be on cuda #478

Closed pwstegman closed 2 years ago

pwstegman commented 2 years ago

Describe the bug

When starting with small.yml, then changing ZeRO to 2 and cpu_offload to true, I get the following error:

RuntimeError: expected input to be on cuda

Similar to https://github.com/EleutherAI/gpt-neox/issues/171.

Full Traceback
NeoXArgs.from_ymls() ['configs/small.yml', 'configs/local_setup_small.yml']
INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4
-------------------- arguments --------------------
  attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated
  attention_dropout ............... 0.0.........................updated
  batch_size ...................... 4...........................updated
  checkpoint_activations .......... True........................updated
  clip_grad ....................... 1.0.........................updated
  config_files .................... {'small.yml': '# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   "pipe-parallel-size": 1,\n   "model-parallel-size": 1,\n\n   # model settings\n   "num-layers": 12,\n   "hidden-size": 768,\n   "num-attention-heads": 12,\n   "seq-length": 2048,\n   "max-position-embeddings": 2048,\n   "norm": "layernorm",\n   "pos-emb": "rotary",\n   "no-weight-tying": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   "scaled-upper-triang-masked-softmax-fusion": false,\n   "bias-gelu-fusion": false,\n\n\n   # optimizer settings\n   "optimizer": {\n     "type": "Adam",\n     "params": {\n       "lr": 0.0006,\n       "betas": [0.9, 0.999],\n       "eps": 1.0e-8,\n     }\n   },\n   "zero_optimization": {\n    "stage": 2,\n    "allgather_partitions": True,\n    "allgather_bucket_size": 500000000,\n    "overlap_comm": True,\n    "reduce_scatter": True,\n    "reduce_bucket_size": 500000000,\n    "contiguous_gradients": True,\n    "cpu_offload": True\n  },\n\n   # batch / data settings\n   "train_micro_batch_size_per_gpu": 4,\n   "data-impl": "mmap",\n   "split": "949,50,1",\n\n   # activation checkpointing\n   "checkpoint-activations": true,\n   "checkpoint-num-layers": 1,\n   "partition-activations": true,\n   "synchronize-each-layer": true,\n\n   # regularization\n   "gradient_clipping": 1.0,\n   "weight-decay": 0.0,\n   "hidden-dropout": 0.0,\n   "attention-dropout": 0.0,\n\n   # precision settings\n   "fp16": { \n     "enabled": true,\n     "loss_scale": 0,\n     "loss_scale_window": 1000,\n     "hysteresis": 2,\n     "min_loss_scale": 1\n   },\n\n   # misc. training settings\n   "train-iters": 320000,\n   "lr-decay-iters": 320000,\n   "distributed-backend": "nccl",\n   "lr-decay-style": "cosine",\n   "warmup": 0.01,\n   "save-interval": 10000,\n   "eval-interval": 1000,\n   "eval-iters": 10,\n\n   # logging\n   "log-interval": 100,\n   "steps_per_print": 10,\n   "keep-last-n-checkpoints": 4,\n   "wall_clock_breakdown": true,\n}\n', 'local_setup_small.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n  # "train-data-paths": ["/mnt/4TBNVME/data/the_pile_preprocessed/train/the_pile_text_document"],\n  "data-path": "data/enron/enron_text_document",\n  # "data-path": "/mnt/4TBNVME/data/the_pile_preprocessed/train/the_pile_text_document",\n  # or for weighted datasets: \n  # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "train-data-weights": [1., 2.],\n  # "test-data-weights": [2., 1.],\n  # "valid-data-weights": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # "weight_by_num_documents": false,\n  # "weighted_sampler_alpha": 0.3,\n\n  "vocab-file": "/mnt/4TBNVME/data/gpt2-vocab.json",\n  "merge-file": "/mnt/4TBNVME/data/gpt2-merges.txt",\n\n  "save": "checkpoints_small",\n  "load": "checkpoints_small",\n  "checkpoint_validation_with_forward_pass": False,\n  \n  "tensorboard-dir": "tensorboard/run4",\n  "log-dir": "logs4",\n  "use_wandb": False,\n  "wandb_host": "https://api.wandb.ai",\n  "wandb_project": "neox"\n}\n'}updated
  data_impl ....................... mmap........................updated
  data_path ....................... data/enron/enron_text_documentupdated
  dynamic_loss_scale .............. True........................updated
  eval_iters ...................... 10..........................updated
  fp16 ............................ {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
  gas ............................. 1...........................updated
  global_num_gpus ................. 4...........................updated
  gradient_clipping ............... 1.0.........................updated
  hidden_dropout .................. 0.0.........................updated
  hidden_size ..................... 768.........................updated
  is_pipe_parallel ................ True........................updated
  keep_last_n_checkpoints ......... 4...........................updated
  load ............................ checkpoints_small...........updated
  log_dir ......................... logs4.......................updated
  log_interval .................... 100.........................updated
  lr .............................. 0.0006......................updated
  lr_decay_iters .................. 320000......................updated
  lr_decay_style .................. cosine......................updated
  max_position_embeddings ......... 2048........................updated
  merge_file ...................... /mnt/4TBNVME/data/gpt2-merges.txtupdated
  no_weight_tying ................. True........................updated
  num_attention_heads ............. 12..........................updated
  num_layers ...................... 12..........................updated
  optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated
  optimizer_type .................. Adam........................updated
  partition_activations ........... True........................updated
  pipe_parallel_size .............. 1...........................updated
  pos_emb ......................... rotary......................updated
  precision ....................... fp16........................updated
  save ............................ checkpoints_small...........updated
  save_interval ................... 10000.......................updated
  seq_length ...................... 2048........................updated
  sparsity_config ................. {}..........................updated
  split ........................... 949,50,1....................updated
  synchronize_each_layer .......... True........................updated
  tensorboard_dir ................. tensorboard/run4............updated
  train_batch_size ................ 16..........................updated
  train_iters ..................... 320000......................updated
  train_micro_batch_size_per_gpu .. 4...........................updated
  use_wandb ....................... False.......................updated
  user_script ..................... train.py....................updated
  vocab_file ...................... /mnt/4TBNVME/data/gpt2-vocab.jsonupdated
  wall_clock_breakdown ............ True........................updated
  wandb_group ..................... YRCNBdcn62NVP5XPQgvHzC_3ml8m47dupdated
  weight_decay .................... 0.0.........................updated
  zero_allgather_bucket_size ...... 500000000...................updated
  zero_contiguous_gradients ....... True........................updated
  zero_optimization ............... {'stage': 2, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': True}updated
  zero_reduce_bucket_size ......... 500000000...................updated
  zero_reduce_scatter ............. True........................updated
  zero_stage ...................... 2...........................updated
  activation ...................... gelu........................default
  adlr_autoresume ................. False.......................default
  adlr_autoresume_interval ........ 1000........................default
  amp ............................. None........................default
  apply_query_key_layer_scaling ... False.......................default
  attention_softmax_in_fp32 ....... False.......................default
  bias_dropout_fusion ............. False.......................default
  bias_gelu_fusion ................ False.......................default
  char_level_ppl .................. False.......................default
  checkpoint_in_cpu ............... False.......................default
  checkpoint_num_layers ........... 1...........................default
  checkpoint_validation_with_forward_pass  False................default
  contiguous_checkpointing ........ False.......................default
  deepscale ....................... False.......................default
  deepscale_config ................ None........................default
  deepspeed ....................... True........................default
  deepspeed_activation_checkpointing  True......................default
  deepspeed_mpi ................... False.......................default
  detect_nvlink_pairs ............. False.......................default
  distributed_backend ............. nccl........................default
  do_test ......................... None........................default
  do_train ........................ None........................default
  do_valid ........................ None........................default
  dump_state ...................... False.......................default
  eod_mask_loss ................... False.......................default
  eval_interval ................... 1000........................default
  eval_results_prefix ............. ............................default
  eval_tasks ...................... None........................default
  exclude ......................... None........................default
  exit_interval ................... None........................default
  finetune ........................ False.......................default
  flops_profiler .................. None........................default
  fp16_lm_cross_entropy ........... False.......................default
  fp32_allreduce .................. False.......................default
  git_hash ........................ f3a982d.....................default
  gmlp_attn_dim ................... 64..........................default
  gpt_j_residual .................. False.......................default
  gradient_accumulation_steps ..... 1...........................default
  gradient_noise_scale_cpu_offload  False.......................default
  gradient_noise_scale_n_batches .. 5...........................default
  gradient_predivide_factor ....... 1.0.........................default
  hostfile ........................ None........................default
  hysteresis ...................... 2...........................default
  include ......................... None........................default
  init_method ..................... normal......................default
  init_method_std ................. 0.02........................default
  iteration ....................... None........................default
  launcher ........................ pdsh........................default
  layernorm_epsilon ............... 1e-05.......................default
  lazy_mpu_init ................... False.......................default
  local_rank ...................... None........................default
  log_grad_norm ................... False.......................default
  log_gradient_noise_scale ........ False.......................default
  log_optimizer_states ............ False.......................default
  log_param_norm .................. False.......................default
  loss_scale ...................... None........................default
  loss_scale_window ............... 1000.0......................default
  make_vocab_size_divisible_by .... 128.........................default
  master_addr ..................... None........................default
  master_port ..................... 29500.......................default
  maximum_tokens .................. 64..........................default
  min_lr .......................... 0.0.........................default
  min_scale ....................... 1.0.........................default
  mmap_warmup ..................... False.......................default
  model_parallel_size ............. 1...........................default
  no_load_optim ................... False.......................default
  no_load_rng ..................... False.......................default
  no_save_optim ................... False.......................default
  no_save_rng ..................... False.......................default
  norm ............................ layernorm...................default
  num_gpus ........................ None........................default
  num_nodes ....................... -1..........................default
  num_samples ..................... 0...........................default
  num_unique_layers ............... None........................default
  num_workers ..................... 2...........................default
  onnx_safe ....................... False.......................default
  output_layer_init_method ........ scaled_normal...............default
  output_layer_parallelism ........ row.........................default
  override_lr_scheduler ........... False.......................default
  padded_vocab_size ............... None........................default
  param_sharing_style ............. grouped.....................default
  pipe_partition_method ........... type:transformer|mlp........default
  prescale_gradients .............. False.......................default
  profile_backward ................ False.......................default
  rank ............................ None........................default
  recompute ....................... False.......................default
  rms_norm_epsilon ................ 1e-08.......................default
  rotary_emb_base ................. 10000.......................default
  rotary_pct ...................... 1.0.........................default
  rpe_max_distance ................ 128.........................default
  rpe_num_buckets ................. 32..........................default
  sample_input_file ............... None........................default
  sample_output_file .............. None........................default
  scaled_masked_softmax_fusion .... False.......................default
  scaled_upper_triang_masked_softmax_fusion  False..............default
  scalenorm_epsilon ............... 1e-08.......................default
  scheduler ....................... None........................default
  seed ............................ 1234........................default
  short_seq_prob .................. 0.1.........................default
  soft_prompt_tuning .............. None........................default
  sparse_gradients ................ False.......................default
  steps_per_print ................. 10..........................default
  temperature ..................... 0.0.........................default
  test_data_paths ................. None........................default
  test_data_weights ............... None........................default
  text_gen_type ................... None........................default
  tokenizer_type .................. GPT2BPETokenizer............default
  top_k ........................... 0...........................default
  top_p ........................... 0.0.........................default
  train_data_paths ................ None........................default
  train_data_weights .............. None........................default
  use_bnb_optimizer ............... False.......................default
  use_checkpoint_lr_scheduler ..... False.......................default
  use_cpu_initialization .......... False.......................default
  valid_data_paths ................ None........................default
  valid_data_weights .............. None........................default
  wandb_host ...................... https://api.wandb.ai........default
  wandb_project ................... neox........................default
  wandb_team ...................... None........................default
  warmup .......................... 0.01........................default
  weight_by_num_documents ......... False.......................default
  weighted_sampler_alpha .......... 0.3.........................default
  world_size ...................... None........................default
  zero_allow_untested_optimizer ... False.......................default
---------------- end of arguments ----------------
[2021-12-10 19:56:34,044] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-12-10 19:56:34,044] [INFO] [runner.py:366:main] cmd = /root/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 train.py --deepspeed_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true} --megatron_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 2, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints_small", "config_files": {"small.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe-parallel-size\": 1,\n   \"model-parallel-size\": 1,\n\n   # model settings\n   \"num-layers\": 12,\n   \"hidden-size\": 768,\n   \"num-attention-heads\": 12,\n   \"seq-length\": 2048,\n   \"max-position-embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos-emb\": \"rotary\",\n   \"no-weight-tying\": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled-upper-triang-masked-softmax-fusion\": false,\n   \"bias-gelu-fusion\": false,\n\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.0006,\n       \"betas\": [0.9, 0.999],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"zero_optimization\": {\n    \"stage\": 2,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n    \"cpu_offload\": True\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data-impl\": \"mmap\",\n   \"split\": \"949,50,1\",\n\n   # activation checkpointing\n   \"checkpoint-activations\": true,\n   \"checkpoint-num-layers\": 1,\n   \"partition-activations\": true,\n   \"synchronize-each-layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight-decay\": 0.0,\n   \"hidden-dropout\": 0.0,\n   \"attention-dropout\": 0.0,\n\n   # precision settings\n   \"fp16\": { \n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train-iters\": 320000,\n   \"lr-decay-iters\": 320000,\n   \"distributed-backend\": \"nccl\",\n   \"lr-decay-style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"save-interval\": 10000,\n   \"eval-interval\": 1000,\n   \"eval-iters\": 10,\n\n   # logging\n   \"log-interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep-last-n-checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup_small.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  # \"train-data-paths\": [\"/mnt/4TBNVME/data/the_pile_preprocessed/train/the_pile_text_document\"],\n  \"data-path\": \"data/enron/enron_text_document\",\n  # \"data-path\": \"/mnt/4TBNVME/data/the_pile_preprocessed/train/the_pile_text_document\",\n  # or for weighted datasets: \n  # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab-file\": \"/mnt/4TBNVME/data/gpt2-vocab.json\",\n  \"merge-file\": \"/mnt/4TBNVME/data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints_small\",\n  \"load\": \"checkpoints_small\",\n  \"checkpoint_validation_with_forward_pass\": False,\n  \n  \"tensorboard-dir\": \"tensorboard/run4\",\n  \"log-dir\": \"logs4\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints_small", "save_interval": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "/mnt/4TBNVME/data/gpt2-vocab.json", "merge_file": "/mnt/4TBNVME/data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": false, "wandb_group": "YRCNBdcn62NVP5XPQgvHzC_3ml8m47d", "log_dir": "logs4", "tensorboard_dir": "tensorboard/run4", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}
[2021-12-10 19:56:34,871] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE libnccl-dev=2.8.4-1+cuda11.1
[2021-12-10 19:56:34,871] [INFO] [launch.py:75:main] 0 NCCL_VERSION 2.8.4-1
[2021-12-10 19:56:34,871] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE_VERSION 2.8.4-1
[2021-12-10 19:56:34,871] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE libnccl2=2.8.4-1+cuda11.1
[2021-12-10 19:56:34,871] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME libnccl-dev
[2021-12-10 19:56:34,871] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE_NAME libnccl2
[2021-12-10 19:56:34,871] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION 2.8.4-1
[2021-12-10 19:56:34,871] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-12-10 19:56:34,871] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-12-10 19:56:34,871] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]})
[2021-12-10 19:56:34,872] [INFO] [launch.py:104:main] dist_world_size=4
[2021-12-10 19:56:34,872] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> setting tensorboard ...
> initializing torch distributed ...
[2021-12-10 19:56:37,852] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-10 19:56:37,873] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-10 19:56:37,905] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-10 19:56:38,017] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
> initializing model parallel with size 1
MPU DP: [0, 1, 2, 3]
MPU PP: [0]
MPU PP: [1]
MPU PP: [2]
MPU PP: [3]
MPU MP: [0]
MPU MP: [1]
MPU MP: [2]
MPU MP: [3]
> setting random seeds to 1234 ...
[2021-12-10 19:56:38,112] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/mnt/4TBNVME/gpt-neox/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/mnt/4TBNVME/gpt-neox/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3}
[2021-12-10 19:56:41,471] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=17
     0: EmbeddingPipe
     1: _pre_transformer_block
     2: ParallelTransformerLayerPipe
     3: ParallelTransformerLayerPipe
     4: ParallelTransformerLayerPipe
     5: ParallelTransformerLayerPipe
     6: ParallelTransformerLayerPipe
     7: ParallelTransformerLayerPipe
     8: ParallelTransformerLayerPipe
     9: ParallelTransformerLayerPipe
    10: ParallelTransformerLayerPipe
    11: ParallelTransformerLayerPipe
    12: ParallelTransformerLayerPipe
    13: ParallelTransformerLayerPipe
    14: _post_transformer_block
    15: NormPipe
    16: ParallelLinearPipe
  loss: partial
[2021-12-10 19:56:41,497] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
> learning rate decay style: cosine
DeepSpeed is enabled.
[2021-12-10 19:56:41,502] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-10 19:56:41,502] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=main
[2021-12-10 19:56:41,502] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-10 19:56:41,506] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
Using /root/.cache/torch_extensions as PyTorch extensions root...
[2021-12-10 19:56:42,718] [INFO] [engine.py:654:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2021-12-10 19:56:42,718] [INFO] [engine.py:659:_configure_optimizer] Using client Optimizer as basic optimizer
[2021-12-10 19:56:42,718] [INFO] [engine.py:668:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
Checking ZeRO support for optimizer=FusedAdam type=
[2021-12-10 19:56:42,718] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2021-12-10 19:56:42,718] [INFO] [stage2.py:102:__init__] Reduce bucket size 500000000
[2021-12-10 19:56:42,718] [INFO] [stage2.py:103:__init__] Allgather bucket size 500000000
[2021-12-10 19:56:42,718] [INFO] [stage2.py:104:__init__] CPU Offload: True
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.7915315628051758 seconds
Loading extension module utils...
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.8036620616912842 seconds
Time to load utils op: 0.8034920692443848 seconds
Time to load utils op: 0.8034873008728027 seconds
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/mnt/4TBNVME/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/mnt/4TBNVME/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/root/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 128, in initialize
    engine = PipelineEngine(args=args,
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/root/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/root/venv/lib/python3.8/site-packages/apex/optimizers/fused_adam.py", line 160, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/root/venv/lib/python3.8/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/mnt/4TBNVME/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/mnt/4TBNVME/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/root/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 128, in initialize
    engine = PipelineEngine(args=args,
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/root/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/root/venv/lib/python3.8/site-packages/apex/optimizers/fused_adam.py", line 160, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/root/venv/lib/python3.8/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/mnt/4TBNVME/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/mnt/4TBNVME/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/root/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 128, in initialize
    engine = PipelineEngine(args=args,
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/root/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/root/venv/lib/python3.8/site-packages/apex/optimizers/fused_adam.py", line 160, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/root/venv/lib/python3.8/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/mnt/4TBNVME/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/mnt/4TBNVME/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/root/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 128, in initialize
    engine = PipelineEngine(args=args,
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/root/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/root/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/root/venv/lib/python3.8/site-packages/apex/optimizers/fused_adam.py", line 160, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/root/venv/lib/python3.8/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Killing subprocess 3126
Killing subprocess 3127
Killing subprocess 3128
Killing subprocess 3130
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 179, in 
    main()
  File "/root/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 169, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/root/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/venv/bin/python', '-u', 'train.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true}', '--megatron_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 2, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints_small", "config_files": {"small.yml": "# GPT-2 pretraining setup\\n{\\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\\n   # across the node boundaries )\\n   \\"pipe-parallel-size\\": 1,\\n   \\"model-parallel-size\\": 1,\\n\\n   # model settings\\n   \\"num-layers\\": 12,\\n   \\"hidden-size\\": 768,\\n   \\"num-attention-heads\\": 12,\\n   \\"seq-length\\": 2048,\\n   \\"max-position-embeddings\\": 2048,\\n   \\"norm\\": \\"layernorm\\",\\n   \\"pos-emb\\": \\"rotary\\",\\n   \\"no-weight-tying\\": true,\\n\\n   # these should provide some speedup but takes a while to build, set to true if desired\\n   \\"scaled-upper-triang-masked-softmax-fusion\\": false,\\n   \\"bias-gelu-fusion\\": false,\\n\\n\\n   # optimizer settings\\n   \\"optimizer\\": {\\n     \\"type\\": \\"Adam\\",\\n     \\"params\\": {\\n       \\"lr\\": 0.0006,\\n       \\"betas\\": [0.9, 0.999],\\n       \\"eps\\": 1.0e-8,\\n     }\\n   },\\n   \\"zero_optimization\\": {\\n    \\"stage\\": 2,\\n    \\"allgather_partitions\\": True,\\n    \\"allgather_bucket_size\\": 500000000,\\n    \\"overlap_comm\\": True,\\n    \\"reduce_scatter\\": True,\\n    \\"reduce_bucket_size\\": 500000000,\\n    \\"contiguous_gradients\\": True,\\n    \\"cpu_offload\\": True\\n  },\\n\\n   # batch / data settings\\n   \\"train_micro_batch_size_per_gpu\\": 4,\\n   \\"data-impl\\": \\"mmap\\",\\n   \\"split\\": \\"949,50,1\\",\\n\\n   # activation checkpointing\\n   \\"checkpoint-activations\\": true,\\n   \\"checkpoint-num-layers\\": 1,\\n   \\"partition-activations\\": true,\\n   \\"synchronize-each-layer\\": true,\\n\\n   # regularization\\n   \\"gradient_clipping\\": 1.0,\\n   \\"weight-decay\\": 0.0,\\n   \\"hidden-dropout\\": 0.0,\\n   \\"attention-dropout\\": 0.0,\\n\\n   # precision settings\\n   \\"fp16\\": { \\n     \\"enabled\\": true,\\n     \\"loss_scale\\": 0,\\n     \\"loss_scale_window\\": 1000,\\n     \\"hysteresis\\": 2,\\n     \\"min_loss_scale\\": 1\\n   },\\n\\n   # misc. training settings\\n   \\"train-iters\\": 320000,\\n   \\"lr-decay-iters\\": 320000,\\n   \\"distributed-backend\\": \\"nccl\\",\\n   \\"lr-decay-style\\": \\"cosine\\",\\n   \\"warmup\\": 0.01,\\n   \\"save-interval\\": 10000,\\n   \\"eval-interval\\": 1000,\\n   \\"eval-iters\\": 10,\\n\\n   # logging\\n   \\"log-interval\\": 100,\\n   \\"steps_per_print\\": 10,\\n   \\"keep-last-n-checkpoints\\": 4,\\n   \\"wall_clock_breakdown\\": true,\\n}\\n", "local_setup_small.yml": "# Suggested data paths when using GPT-NeoX locally\\n{\\n  # \\"train-data-paths\\": [\\"/mnt/4TBNVME/data/the_pile_preprocessed/train/the_pile_text_document\\"],\\n  \\"data-path\\": \\"data/enron/enron_text_document\\",\\n  # \\"data-path\\": \\"/mnt/4TBNVME/data/the_pile_preprocessed/train/the_pile_text_document\\",\\n  # or for weighted datasets: \\n  # \\"train-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"test-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"valid-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"train-data-weights\\": [1., 2.],\\n  # \\"test-data-weights\\": [2., 1.],\\n  # \\"valid-data-weights\\": [0.5, 0.4],\\n\\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \\n  # WARNING: setting this to True will override any user provided weights\\n  # \\"weight_by_num_documents\\": false,\\n  # \\"weighted_sampler_alpha\\": 0.3,\\n\\n  \\"vocab-file\\": \\"/mnt/4TBNVME/data/gpt2-vocab.json\\",\\n  \\"merge-file\\": \\"/mnt/4TBNVME/data/gpt2-merges.txt\\",\\n\\n  \\"save\\": \\"checkpoints_small\\",\\n  \\"load\\": \\"checkpoints_small\\",\\n  \\"checkpoint_validation_with_forward_pass\\": False,\\n  \\n  \\"tensorboard-dir\\": \\"tensorboard/run4\\",\\n  \\"log-dir\\": \\"logs4\\",\\n  \\"use_wandb\\": False,\\n  \\"wandb_host\\": \\"https://api.wandb.ai\\",\\n  \\"wandb_project\\": \\"neox\\"\\n}\\n"}, "load": "checkpoints_small", "save_interval": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "/mnt/4TBNVME/data/gpt2-vocab.json", "merge_file": "/mnt/4TBNVME/data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": false, "wandb_group": "YRCNBdcn62NVP5XPQgvHzC_3ml8m47d", "log_dir": "logs4", "tensorboard_dir": "tensorboard/run4", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}']' returned non-zero exit status 1.

To Reproduce

Starting on a host with NVIDIA driver 460.91.03, I ran a new docker container using image nvidia/cuda:11.1.1-devel-ubuntu18.04. I created a new Python 3.8.0 virtual env, installed torch 1.8.2+cu111, installed the latest NVIDIA apex, then installed all the requirements in each file within the requirements directory of GPT-NeoX.

I changed two lines in the small.yml config in the "zero_optimization" section:

"stage": 2 "cpu_offload": True

Then I ran:

python deepy.py train.py --conf_dir configs small.yml local_setup_small.yml

This caused the mentioned error.

Expected behavior

The optimizer should be offloaded to the CPU.

Proposed solution

No proposed solutions yet. Are there any known working cpu_offload configs? I can start there to try and debug this.

Screenshots

None

Environment (please complete the following information):

small.yml (modified)

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 1,
   "model-parallel-size": 1,

   # model settings
   "num-layers": 12,
   "hidden-size": 768,
   "num-attention-heads": 12,
   "seq-length": 2048,
   "max-position-embeddings": 2048,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": false,
   "bias-gelu-fusion": false,

   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.0006,
       "betas": [0.9, 0.999],
       "eps": 1.0e-8,
     }
   },
   "zero_optimization": {
    "stage": 2,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
    "cpu_offload": True
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 4,
   "data-impl": "mmap",
   "split": "949,50,1",

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0.0,
   "hidden-dropout": 0.0,
   "attention-dropout": 0.0,

   # precision settings
   "fp16": {
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train-iters": 320000,
   "lr-decay-iters": 320000,
   "distributed-backend": "nccl",
   "lr-decay-style": "cosine",
   "warmup": 0.01,
   "save-interval": 10000,
   "eval-interval": 1000,
   "eval-iters": 10,

   # logging
   "log-interval": 100,
   "steps_per_print": 10,
   "keep-last-n-checkpoints": 4,
   "wall_clock_breakdown": true,
}

local_setup_small.yml

# Suggested data paths when using GPT-NeoX locally
{
  "data-path": "data/enron/enron_text_document",

  "vocab-file": "/mnt/4TBNVME/data/gpt2-vocab.json",
  "merge-file": "/mnt/4TBNVME/data/gpt2-merges.txt",

  "save": "checkpoints_small",
  "load": "checkpoints_small",
  "checkpoint_validation_with_forward_pass": False,

  "tensorboard-dir": "tensorboard/run4",
  "log-dir": "logs4",
  "use_wandb": False,
  "wandb_host": "https://api.wandb.ai",
  "wandb_project": "neox"
}

Additional context

None

EricHallahan commented 2 years ago

I created a new Python 3.8.0 virtual env, installed torch 1.8.2+cu111, installed the latest NVIDIA apex, then installed all the requirements in each file within the requirements directory of GPT-NeoX.

I notice that you do not explicitly mention DeepSpeed or DeeperSpeed. As GPT-NeoX relies on DeeperSpeed for ZeRO and the rest of its features for distributed training, it is an important piece of the puzzle. DeeperSpeed should be installed from ./requirements/requirements.txt, so I would verify that install completes correctly and test ZeRO 2 on some other code so we can be sure this is a GPT-NeoX problem rather than a DeeperSpeed problem.

Starting on a host with NVIDIA driver 460.91.03, I ran a new docker container using image nvidia/cuda:11.1.1-devel-ubuntu18.04.

I would strongly suggest using our prebuilt images or building your own image from the included Dockerfile—all of the critical dependencies will be preinstalled and ready to use. It is a lot easier to use our verified environment than to install the dependencies from scratch yourself.

pwstegman commented 2 years ago

Thanks for the reply! I installed directly from the requirements.txt files, so DeeperSpeed was installed. I've trained using ZeRO 1 without issue; the error only occurs once I've made the following two config changes to small.yml:

"stage": 2 "cpu_offload": True

I gave the docker container referenced in the README a try and got the same issue:

git clone https://github.com/EleutherAI/gpt-neox.git
cd gpt-neox
nvidia-docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD,dst=/gpt-neox leogao2/gpt-neox:sha-63e10af
cd /gpt-neox
python prepare_data.py enron
# Confirm default config works
python deepy.py train.py -d configs/ small.yml local_setup.yml
# Got error saying to run the below command (sudo -s added by me to correct permission denied errors)
sudo -s
python /gpt-neox/megatron/fused_kernels/setup.py install
# Rerun the original command
python deepy.py train.py -d configs/ small.yml local_setup.yml
# Training ran without issue. CTRL+Ced out after 60 steps
# Enable ZeRO 2 cpu_offload
cp configs/small.yml configs/small_offload.yml
sed -i 's/"stage": 0/"stage": 2/g' configs/small_offload.yml
sed -i 's/"cpu_offload": False/"cpu_offload": True/g' configs/small_offload.yml
# Run with CPU offload
python deepy.py train.py -d configs/ small_offload.yml local_setup.yml
# The above command throws "RuntimeError: expected input to be on cuda"
Full Traceback
NeoXArgs.from_ymls() ['configs/small_offload.yml', 'configs/local_setup.yml']
INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4
-------------------- arguments --------------------
  attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated
  attention_dropout ............... 0.0.........................updated
  batch_size ...................... 4...........................updated
  checkpoint_activations .......... True........................updated
  clip_grad ....................... 1.0.........................updated
  config_files .................... {'small_offload.yml': '# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   "pipe-parallel-size": 1,\n   "model-parallel-size": 1,\n\n   # model settings\n   "num-layers": 12,\n   "hidden-size": 768,\n   "num-attention-heads": 12,\n   "seq-length": 2048,\n   "max-position-embeddings": 2048,\n   "norm": "layernorm",\n   "pos-emb": "rotary",\n   "no-weight-tying": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   "scaled-upper-triang-masked-softmax-fusion": false,\n   "bias-gelu-fusion": false,\n\n\n   # optimizer settings\n   "optimizer": {\n     "type": "Adam",\n     "params": {\n       "lr": 0.0006,\n       "betas": [0.9, 0.999],\n       "eps": 1.0e-8,\n     }\n   },\n   "zero_optimization": {\n    "stage": 2,\n    "allgather_partitions": True,\n    "allgather_bucket_size": 500000000,\n    "overlap_comm": True,\n    "reduce_scatter": True,\n    "reduce_bucket_size": 500000000,\n    "contiguous_gradients": True,\n    "cpu_offload": True\n  },\n\n   # batch / data settings\n   "train_micro_batch_size_per_gpu": 4,\n   "data-impl": "mmap",\n   "split": "949,50,1",\n\n   # activation checkpointing\n   "checkpoint-activations": true,\n   "checkpoint-num-layers": 1,\n   "partition-activations": true,\n   "synchronize-each-layer": true,\n\n   # regularization\n   "gradient_clipping": 1.0,\n   "weight-decay": 0.0,\n   "hidden-dropout": 0.0,\n   "attention-dropout": 0.0,\n\n   # precision settings\n   "fp16": { \n     "enabled": true,\n     "loss_scale": 0,\n     "loss_scale_window": 1000,\n     "hysteresis": 2,\n     "min_loss_scale": 1\n   },\n\n   # misc. training settings\n   "train-iters": 320000,\n   "lr-decay-iters": 320000,\n   "distributed-backend": "nccl",\n   "lr-decay-style": "cosine",\n   "warmup": 0.01,\n   "save-interval": 10000,\n   "eval-interval": 1000,\n   "eval-iters": 10,\n\n   # logging\n   "log-interval": 100,\n   "steps_per_print": 10,\n   "keep-last-n-checkpoints": 4,\n   "wall_clock_breakdown": true,\n}\n', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n  "data-path": "data/enron/enron_text_document",\n  \n  # or for weighted datasets: \n  # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "train-data-weights": [1., 2.],\n  # "test-data-weights": [2., 1.],\n  # "valid-data-weights": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # "weight_by_num_documents": false,\n  # "weighted_sampler_alpha": 0.3,\n\n  "vocab-file": "data/gpt2-vocab.json",\n  "merge-file": "data/gpt2-merges.txt",\n\n  "save": "checkpoints",\n  "load": "checkpoints",\n  "checkpoint_validation_with_forward_pass": False,\n  \n  "tensorboard-dir": "tensorboard",\n  "log-dir": "logs",\n  "use_wandb": True,\n  "wandb_host": "https://api.wandb.ai",\n  "wandb_project": "neox"\n}'}updated
  data_impl ....................... mmap........................updated
  data_path ....................... data/enron/enron_text_documentupdated
  dynamic_loss_scale .............. True........................updated
  eval_iters ...................... 10..........................updated
  fp16 ............................ {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
  gas ............................. 1...........................updated
  global_num_gpus ................. 4...........................updated
  gradient_clipping ............... 1.0.........................updated
  hidden_dropout .................. 0.0.........................updated
  hidden_size ..................... 768.........................updated
  is_pipe_parallel ................ True........................updated
  keep_last_n_checkpoints ......... 4...........................updated
  load ............................ checkpoints.................updated
  log_dir ......................... logs........................updated
  log_interval .................... 100.........................updated
  lr .............................. 0.0006......................updated
  lr_decay_iters .................. 320000......................updated
  lr_decay_style .................. cosine......................updated
  max_position_embeddings ......... 2048........................updated
  merge_file ...................... data/gpt2-merges.txt........updated
  no_weight_tying ................. True........................updated
  num_attention_heads ............. 12..........................updated
  num_layers ...................... 12..........................updated
  optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated
  optimizer_type .................. Adam........................updated
  partition_activations ........... True........................updated
  pipe_parallel_size .............. 1...........................updated
  pos_emb ......................... rotary......................updated
  precision ....................... fp16........................updated
  save ............................ checkpoints.................updated
  save_interval ................... 10000.......................updated
  seq_length ...................... 2048........................updated
  sparsity_config ................. {}..........................updated
  split ........................... 949,50,1....................updated
  synchronize_each_layer .......... True........................updated
  tensorboard_dir ................. tensorboard.................updated
  train_batch_size ................ 16..........................updated
  train_iters ..................... 320000......................updated
  train_micro_batch_size_per_gpu .. 4...........................updated
  use_wandb ....................... True........................updated
  user_script ..................... train.py....................updated
  vocab_file ...................... data/gpt2-vocab.json........updated
  wall_clock_breakdown ............ True........................updated
  wandb_group ..................... QZUNHRmPyM5gy9C2ReHsUj_1fcswh3eupdated
  weight_decay .................... 0.0.........................updated
  zero_allgather_bucket_size ...... 500000000...................updated
  zero_contiguous_gradients ....... True........................updated
  zero_optimization ............... {'stage': 2, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': True}updated
  zero_reduce_bucket_size ......... 500000000...................updated
  zero_reduce_scatter ............. True........................updated
  zero_stage ...................... 2...........................updated
  activation ...................... gelu........................default
  adlr_autoresume ................. False.......................default
  adlr_autoresume_interval ........ 1000........................default
  amp ............................. None........................default
  apply_query_key_layer_scaling ... False.......................default
  attention_softmax_in_fp32 ....... False.......................default
  bias_dropout_fusion ............. False.......................default
  bias_gelu_fusion ................ False.......................default
  char_level_ppl .................. False.......................default
  checkpoint_in_cpu ............... False.......................default
  checkpoint_num_layers ........... 1...........................default
  checkpoint_validation_with_forward_pass  False................default
  contiguous_checkpointing ........ False.......................default
  deepscale ....................... False.......................default
  deepscale_config ................ None........................default
  deepspeed ....................... True........................default
  deepspeed_activation_checkpointing  True......................default
  deepspeed_mpi ................... False.......................default
  detect_nvlink_pairs ............. False.......................default
  distributed_backend ............. nccl........................default
  do_test ......................... None........................default
  do_train ........................ None........................default
  do_valid ........................ None........................default
  dump_state ...................... False.......................default
  eod_mask_loss ................... False.......................default
  eval_interval ................... 1000........................default
  eval_results_prefix ............. ............................default
  eval_tasks ...................... None........................default
  exclude ......................... None........................default
  exit_interval ................... None........................default
  finetune ........................ False.......................default
  flops_profiler .................. None........................default
  fp16_lm_cross_entropy ........... False.......................default
  fp32_allreduce .................. False.......................default
  git_hash ........................ 49e60fe.....................default
  gmlp_attn_dim ................... 64..........................default
  gpt_j_residual .................. False.......................default
  gradient_accumulation_steps ..... 1...........................default
  gradient_noise_scale_cpu_offload  False.......................default
  gradient_noise_scale_n_batches .. 5...........................default
  gradient_predivide_factor ....... 1.0.........................default
  hostfile ........................ None........................default
  hysteresis ...................... 2...........................default
  include ......................... None........................default
  init_method ..................... normal......................default
  init_method_std ................. 0.02........................default
  iteration ....................... None........................default
  launcher ........................ pdsh........................default
  layernorm_epsilon ............... 1e-05.......................default
  lazy_mpu_init ................... False.......................default
  local_rank ...................... None........................default
  log_grad_norm ................... False.......................default
  log_gradient_noise_scale ........ False.......................default
  log_optimizer_states ............ False.......................default
  log_param_norm .................. False.......................default
  loss_scale ...................... None........................default
  loss_scale_window ............... 1000.0......................default
  make_vocab_size_divisible_by .... 128.........................default
  master_addr ..................... None........................default
  master_port ..................... 29500.......................default
  maximum_tokens .................. 64..........................default
  min_lr .......................... 0.0.........................default
  min_scale ....................... 1.0.........................default
  mmap_warmup ..................... False.......................default
  model_parallel_size ............. 1...........................default
  no_load_optim ................... False.......................default
  no_load_rng ..................... False.......................default
  no_save_optim ................... False.......................default
  no_save_rng ..................... False.......................default
  norm ............................ layernorm...................default
  num_gpus ........................ None........................default
  num_nodes ....................... -1..........................default
  num_samples ..................... 0...........................default
  num_unique_layers ............... None........................default
  num_workers ..................... 2...........................default
  onnx_safe ....................... False.......................default
  output_layer_init_method ........ scaled_normal...............default
  output_layer_parallelism ........ row.........................default
  override_lr_scheduler ........... False.......................default
  padded_vocab_size ............... None........................default
  param_sharing_style ............. grouped.....................default
  pipe_partition_method ........... type:transformer|mlp........default
  prescale_gradients .............. False.......................default
  profile_backward ................ False.......................default
  rank ............................ None........................default
  recompute ....................... False.......................default
  rms_norm_epsilon ................ 1e-08.......................default
  rotary_emb_base ................. 10000.......................default
  rotary_pct ...................... 1.0.........................default
  rpe_max_distance ................ 128.........................default
  rpe_num_buckets ................. 32..........................default
  sample_input_file ............... None........................default
  sample_output_file .............. None........................default
  scaled_masked_softmax_fusion .... False.......................default
  scaled_upper_triang_masked_softmax_fusion  False..............default
  scalenorm_epsilon ............... 1e-08.......................default
  scheduler ....................... None........................default
  seed ............................ 1234........................default
  short_seq_prob .................. 0.1.........................default
  soft_prompt_tuning .............. None........................default
  sparse_gradients ................ False.......................default
  steps_per_print ................. 10..........................default
  temperature ..................... 0.0.........................default
  test_data_paths ................. None........................default
  test_data_weights ............... None........................default
  text_gen_type ................... None........................default
  tokenizer_type .................. GPT2BPETokenizer............default
  top_k ........................... 0...........................default
  top_p ........................... 0.0.........................default
  train_data_paths ................ None........................default
  train_data_weights .............. None........................default
  use_bnb_optimizer ............... False.......................default
  use_checkpoint_lr_scheduler ..... False.......................default
  use_cpu_initialization .......... False.......................default
  valid_data_paths ................ None........................default
  valid_data_weights .............. None........................default
  wandb_host ...................... https://api.wandb.ai........default
  wandb_project ................... neox........................default
  wandb_team ...................... None........................default
  warmup .......................... 0.01........................default
  weight_by_num_documents ......... False.......................default
  weighted_sampler_alpha .......... 0.3.........................default
  world_size ...................... None........................default
  zero_allow_untested_optimizer ... False.......................default
---------------- end of arguments ----------------
[2021-12-11 00:42:41,060] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-12-11 00:42:41,060] [INFO] [runner.py:366:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 train.py --deepspeed_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true} --megatron_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 2, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small_offload.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe-parallel-size\": 1,\n   \"model-parallel-size\": 1,\n\n   # model settings\n   \"num-layers\": 12,\n   \"hidden-size\": 768,\n   \"num-attention-heads\": 12,\n   \"seq-length\": 2048,\n   \"max-position-embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos-emb\": \"rotary\",\n   \"no-weight-tying\": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled-upper-triang-masked-softmax-fusion\": false,\n   \"bias-gelu-fusion\": false,\n\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.0006,\n       \"betas\": [0.9, 0.999],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"zero_optimization\": {\n    \"stage\": 2,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n    \"cpu_offload\": True\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data-impl\": \"mmap\",\n   \"split\": \"949,50,1\",\n\n   # activation checkpointing\n   \"checkpoint-activations\": true,\n   \"checkpoint-num-layers\": 1,\n   \"partition-activations\": true,\n   \"synchronize-each-layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight-decay\": 0.0,\n   \"hidden-dropout\": 0.0,\n   \"attention-dropout\": 0.0,\n\n   # precision settings\n   \"fp16\": { \n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train-iters\": 320000,\n   \"lr-decay-iters\": 320000,\n   \"distributed-backend\": \"nccl\",\n   \"lr-decay-style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"save-interval\": 10000,\n   \"eval-interval\": 1000,\n   \"eval-iters\": 10,\n\n   # logging\n   \"log-interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep-last-n-checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data-path\": \"data/enron/enron_text_document\",\n  \n  # or for weighted datasets: \n  # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab-file\": \"data/gpt2-vocab.json\",\n  \"merge-file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n  \n  \"tensorboard-dir\": \"tensorboard\",\n  \"log-dir\": \"logs\",\n  \"use_wandb\": True,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}"}, "load": "checkpoints", "save_interval": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": true, "wandb_group": "QZUNHRmPyM5gy9C2ReHsUj_1fcswh3e", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}
[2021-12-11 00:42:42,055] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-12-11 00:42:42,055] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-12-11 00:42:42,055] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]})
[2021-12-11 00:42:42,055] [INFO] [launch.py:104:main] dist_world_size=4
[2021-12-11 00:42:42,055] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1
> building GPT2BPETokenizer tokenizer ...
[2021-12-11 00:42:45,229] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-11 00:42:45,233] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-11 00:42:45,240] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later and do you have tensorboard installed?), no TensorBoard logs will be written.
> initializing torch distributed ...
[2021-12-11 00:42:45,284] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
> initializing model parallel with size 1
MPU DP: [0, 1, 2, 3]
MPU PP: [0]
MPU PP: [1]
MPU PP: [2]
MPU PP: [3]
MPU MP: [0]
MPU MP: [1]
MPU MP: [2]
MPU MP: [3]
> setting random seeds to 1234 ...
[2021-12-11 00:42:46,317] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/gpt-neox/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/gpt-neox/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3}
[2021-12-11 00:42:50,771] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=17
     0: EmbeddingPipe
     1: _pre_transformer_block
     2: ParallelTransformerLayerPipe
     3: ParallelTransformerLayerPipe
     4: ParallelTransformerLayerPipe
     5: ParallelTransformerLayerPipe
     6: ParallelTransformerLayerPipe
     7: ParallelTransformerLayerPipe
     8: ParallelTransformerLayerPipe
     9: ParallelTransformerLayerPipe
    10: ParallelTransformerLayerPipe
    11: ParallelTransformerLayerPipe
    12: ParallelTransformerLayerPipe
    13: ParallelTransformerLayerPipe
    14: _post_transformer_block
    15: NormPipe
    16: ParallelLinearPipe
  loss: partial
[2021-12-11 00:42:50,808] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
> learning rate decay style: cosine
DeepSpeed is enabled.
[2021-12-11 00:42:50,810] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=main
[2021-12-11 00:42:50,811] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-11 00:42:50,813] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-11 00:42:50,814] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-11 00:42:52,185] [INFO] [engine.py:654:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2021-12-11 00:42:52,186] [INFO] [engine.py:659:_configure_optimizer] Using client Optimizer as basic optimizer
[2021-12-11 00:42:52,186] [INFO] [engine.py:668:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
Checking ZeRO support for optimizer=FusedAdam type=
[2021-12-11 00:42:52,186] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2021-12-11 00:42:52,186] [INFO] [stage2.py:102:__init__] Reduce bucket size 500000000
[2021-12-11 00:42:52,186] [INFO] [stage2.py:103:__init__] Allgather bucket size 500000000
[2021-12-11 00:42:52,186] [INFO] [stage2.py:104:__init__] CPU Offload: True
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.35228919982910156 seconds
Loading extension module utils...Loading extension module utils...

Loading extension module utils...
Time to load utils op: 0.402698278427124 secondsTime to load utils op: 0.40264415740966797 seconds

Time to load utils op: 0.40321850776672363 seconds
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 128, in initialize
    engine = PipelineEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 128, in initialize
    engine = PipelineEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 128, in initialize
    engine = PipelineEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 128, in initialize
    engine = PipelineEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Killing subprocess 4252
Killing subprocess 4253
Killing subprocess 4254
Killing subprocess 4255
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 179, in 
    main()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 169, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'train.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true}', '--megatron_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 2, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small_offload.yml": "# GPT-2 pretraining setup\\n{\\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\\n   # across the node boundaries )\\n   \\"pipe-parallel-size\\": 1,\\n   \\"model-parallel-size\\": 1,\\n\\n   # model settings\\n   \\"num-layers\\": 12,\\n   \\"hidden-size\\": 768,\\n   \\"num-attention-heads\\": 12,\\n   \\"seq-length\\": 2048,\\n   \\"max-position-embeddings\\": 2048,\\n   \\"norm\\": \\"layernorm\\",\\n   \\"pos-emb\\": \\"rotary\\",\\n   \\"no-weight-tying\\": true,\\n\\n   # these should provide some speedup but takes a while to build, set to true if desired\\n   \\"scaled-upper-triang-masked-softmax-fusion\\": false,\\n   \\"bias-gelu-fusion\\": false,\\n\\n\\n   # optimizer settings\\n   \\"optimizer\\": {\\n     \\"type\\": \\"Adam\\",\\n     \\"params\\": {\\n       \\"lr\\": 0.0006,\\n       \\"betas\\": [0.9, 0.999],\\n       \\"eps\\": 1.0e-8,\\n     }\\n   },\\n   \\"zero_optimization\\": {\\n    \\"stage\\": 2,\\n    \\"allgather_partitions\\": True,\\n    \\"allgather_bucket_size\\": 500000000,\\n    \\"overlap_comm\\": True,\\n    \\"reduce_scatter\\": True,\\n    \\"reduce_bucket_size\\": 500000000,\\n    \\"contiguous_gradients\\": True,\\n    \\"cpu_offload\\": True\\n  },\\n\\n   # batch / data settings\\n   \\"train_micro_batch_size_per_gpu\\": 4,\\n   \\"data-impl\\": \\"mmap\\",\\n   \\"split\\": \\"949,50,1\\",\\n\\n   # activation checkpointing\\n   \\"checkpoint-activations\\": true,\\n   \\"checkpoint-num-layers\\": 1,\\n   \\"partition-activations\\": true,\\n   \\"synchronize-each-layer\\": true,\\n\\n   # regularization\\n   \\"gradient_clipping\\": 1.0,\\n   \\"weight-decay\\": 0.0,\\n   \\"hidden-dropout\\": 0.0,\\n   \\"attention-dropout\\": 0.0,\\n\\n   # precision settings\\n   \\"fp16\\": { \\n     \\"enabled\\": true,\\n     \\"loss_scale\\": 0,\\n     \\"loss_scale_window\\": 1000,\\n     \\"hysteresis\\": 2,\\n     \\"min_loss_scale\\": 1\\n   },\\n\\n   # misc. training settings\\n   \\"train-iters\\": 320000,\\n   \\"lr-decay-iters\\": 320000,\\n   \\"distributed-backend\\": \\"nccl\\",\\n   \\"lr-decay-style\\": \\"cosine\\",\\n   \\"warmup\\": 0.01,\\n   \\"save-interval\\": 10000,\\n   \\"eval-interval\\": 1000,\\n   \\"eval-iters\\": 10,\\n\\n   # logging\\n   \\"log-interval\\": 100,\\n   \\"steps_per_print\\": 10,\\n   \\"keep-last-n-checkpoints\\": 4,\\n   \\"wall_clock_breakdown\\": true,\\n}\\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\\n{\\n  \\"data-path\\": \\"data/enron/enron_text_document\\",\\n  \\n  # or for weighted datasets: \\n  # \\"train-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"test-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"valid-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"train-data-weights\\": [1., 2.],\\n  # \\"test-data-weights\\": [2., 1.],\\n  # \\"valid-data-weights\\": [0.5, 0.4],\\n\\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \\n  # WARNING: setting this to True will override any user provided weights\\n  # \\"weight_by_num_documents\\": false,\\n  # \\"weighted_sampler_alpha\\": 0.3,\\n\\n  \\"vocab-file\\": \\"data/gpt2-vocab.json\\",\\n  \\"merge-file\\": \\"data/gpt2-merges.txt\\",\\n\\n  \\"save\\": \\"checkpoints\\",\\n  \\"load\\": \\"checkpoints\\",\\n  \\"checkpoint_validation_with_forward_pass\\": False,\\n  \\n  \\"tensorboard-dir\\": \\"tensorboard\\",\\n  \\"log-dir\\": \\"logs\\",\\n  \\"use_wandb\\": True,\\n  \\"wandb_host\\": \\"https://api.wandb.ai\\",\\n  \\"wandb_project\\": \\"neox\\"\\n}"}, "load": "checkpoints", "save_interval": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": true, "wandb_group": "QZUNHRmPyM5gy9C2ReHsUj_1fcswh3e", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}']' returned non-zero exit status 1.

I don't have time right now, but in the next couple weeks I might try to make a simple feed forward network, then I can wrap it in DeeperSpeed ZeRO 2 with CPU offload to try and isolate the issue to either GPT-NeoX or DeeperSpeed.

Has cpu_offload been confirmed to work in the past with GPT-NeoX, and do you know what the known working configs are? If I know there was a config that worked in the past, it'll serve as a good starting point for debugging.

StellaAthena commented 2 years ago

@pwstegman If you set pipe-parallel-size=0 what happens? With PP = 1, it’s supposed to initialize as a PP module and then convert to a non-pipeline model, but I vaguely recall there being a bug and that not happening.

Pipeline parallelism and ZeRO 2 are incompatible, so if it’s a one-stage pipeline module that could cause the problem.

The printout says it’s a pipeline module, but off the top of my head I’m not 100% sure that we put the print statement in the right place relative to the casting to non-pipeline :P

StellaAthena commented 2 years ago

Also, just to confirm, the configs you’re describing are a minimal example or a test to make sure things work right? This is a very poor way to actually train a 125M model on your hardware. You should not need CPU offload unless you’re trying to train a model with at least 40B parameters.

mallorbc commented 2 years ago

Pipeline parallelism and Zero 2/3 are not compatible then? How would one train a large model say 20B parameters without pipeline parallelism while using CPU offload, even if just a single step? Using Zero 2/3 I have encountered OOM issues with a 3090 on smaller models, though still higher than without using those higher-level Zero, not on this repo per say, but in other areas such as finetuning GPTJ.

I have finetuned GPTJ on GPUs using CPU offloading and a 3090, but trying to train a model with the 6B config here without offloading leads to OOM issues.

I have encountered the same issue as this task highlights. "RuntimeError: expected input to be on cuda", When trying to enable CPU offloading.

Without changing pipeline parallelism from 1 to 0, my RAM usage increases as if properly working before encountering that issue. After changing to 0, the RAM does not increase as expected and I get an OOM error promptly.

pwstegman commented 2 years ago

@StellaAthena Just tried setting pipeline parallelism to 0 (model parallel was 1) and got the same error.

Traceback

NeoXArgs.from_ymls() ['configs/small_offload.yml', 'configs/local_setup.yml']
INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4
-------------------- arguments --------------------
  attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated
  attention_dropout ............... 0.0.........................updated
  batch_size ...................... 4...........................updated
  checkpoint_activations .......... True........................updated
  clip_grad ....................... 1.0.........................updated
  config_files .................... {'small_offload.yml': '# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   "pipe-parallel-size": 0,\n   "model-parallel-size": 1,\n\n   # model settings\n   "num-layers": 12,\n   "hidden-size": 768,\n   "num-attention-heads": 12,\n   "seq-length": 2048,\n   "max-position-embeddings": 2048,\n   "norm": "layernorm",\n   "pos-emb": "rotary",\n   "no-weight-tying": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   "scaled-upper-triang-masked-softmax-fusion": false,\n   "bias-gelu-fusion": false,\n\n\n   # optimizer settings\n   "optimizer": {\n     "type": "Adam",\n     "params": {\n       "lr": 0.0006,\n       "betas": [0.9, 0.999],\n       "eps": 1.0e-8,\n     }\n   },\n   "zero_optimization": {\n    "stage": 2,\n    "allgather_partitions": True,\n    "allgather_bucket_size": 500000000,\n    "overlap_comm": True,\n    "reduce_scatter": True,\n    "reduce_bucket_size": 500000000,\n    "contiguous_gradients": True,\n    "cpu_offload": True\n  },\n\n   # batch / data settings\n   "train_micro_batch_size_per_gpu": 4,\n   "data-impl": "mmap",\n   "split": "949,50,1",\n\n   # activation checkpointing\n   "checkpoint-activations": true,\n   "checkpoint-num-layers": 1,\n   "partition-activations": true,\n   "synchronize-each-layer": true,\n\n   # regularization\n   "gradient_clipping": 1.0,\n   "weight-decay": 0.0,\n   "hidden-dropout": 0.0,\n   "attention-dropout": 0.0,\n\n   # precision settings\n   "fp16": { \n     "enabled": true,\n     "loss_scale": 0,\n     "loss_scale_window": 1000,\n     "hysteresis": 2,\n     "min_loss_scale": 1\n   },\n\n   # misc. training settings\n   "train-iters": 320000,\n   "lr-decay-iters": 320000,\n   "distributed-backend": "nccl",\n   "lr-decay-style": "cosine",\n   "warmup": 0.01,\n   "save-interval": 10000,\n   "eval-interval": 1000,\n   "eval-iters": 10,\n\n   # logging\n   "log-interval": 100,\n   "steps_per_print": 10,\n   "keep-last-n-checkpoints": 4,\n   "wall_clock_breakdown": true,\n}\n', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n  "data-path": "data/enron/enron_text_document",\n  \n  # or for weighted datasets: \n  # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "train-data-weights": [1., 2.],\n  # "test-data-weights": [2., 1.],\n  # "valid-data-weights": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # "weight_by_num_documents": false,\n  # "weighted_sampler_alpha": 0.3,\n\n  "vocab-file": "data/gpt2-vocab.json",\n  "merge-file": "data/gpt2-merges.txt",\n\n  "save": "checkpoints",\n  "load": "checkpoints",\n  "checkpoint_validation_with_forward_pass": False,\n  \n  "tensorboard-dir": "tensorboard",\n  "log-dir": "logs",\n  "use_wandb": True,\n  "wandb_host": "https://api.wandb.ai",\n  "wandb_project": "neox"\n}'}updated
  data_impl ....................... mmap........................updated
  data_path ....................... data/enron/enron_text_documentupdated
  dynamic_loss_scale .............. True........................updated
  eval_iters ...................... 10..........................updated
  fp16 ............................ {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
  gas ............................. 1...........................updated
  global_num_gpus ................. 4...........................updated
  gradient_clipping ............... 1.0.........................updated
  hidden_dropout .................. 0.0.........................updated
  hidden_size ..................... 768.........................updated
  keep_last_n_checkpoints ......... 4...........................updated
  load ............................ checkpoints.................updated
  log_dir ......................... logs........................updated
  log_interval .................... 100.........................updated
  lr .............................. 0.0006......................updated
  lr_decay_iters .................. 320000......................updated
  lr_decay_style .................. cosine......................updated
  max_position_embeddings ......... 2048........................updated
  merge_file ...................... data/gpt2-merges.txt........updated
  no_weight_tying ................. True........................updated
  num_attention_heads ............. 12..........................updated
  num_layers ...................... 12..........................updated
  optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated
  optimizer_type .................. Adam........................updated
  partition_activations ........... True........................updated
  pos_emb ......................... rotary......................updated
  precision ....................... fp16........................updated
  save ............................ checkpoints.................updated
  save_interval ................... 10000.......................updated
  seq_length ...................... 2048........................updated
  sparsity_config ................. {}..........................updated
  split ........................... 949,50,1....................updated
  synchronize_each_layer .......... True........................updated
  tensorboard_dir ................. tensorboard.................updated
  train_batch_size ................ 16..........................updated
  train_iters ..................... 320000......................updated
  train_micro_batch_size_per_gpu .. 4...........................updated
  use_wandb ....................... True........................updated
  user_script ..................... train.py....................updated
  vocab_file ...................... data/gpt2-vocab.json........updated
  wall_clock_breakdown ............ True........................updated
  wandb_group ..................... 2JTahMMU4myukKqf7nUqJp_35tzwh2kupdated
  weight_decay .................... 0.0.........................updated
  zero_allgather_bucket_size ...... 500000000...................updated
  zero_contiguous_gradients ....... True........................updated
  zero_optimization ............... {'stage': 2, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': True}updated
  zero_reduce_bucket_size ......... 500000000...................updated
  zero_reduce_scatter ............. True........................updated
  zero_stage ...................... 2...........................updated
  activation ...................... gelu........................default
  adlr_autoresume ................. False.......................default
  adlr_autoresume_interval ........ 1000........................default
  amp ............................. None........................default
  apply_query_key_layer_scaling ... False.......................default
  attention_softmax_in_fp32 ....... False.......................default
  bias_dropout_fusion ............. False.......................default
  bias_gelu_fusion ................ False.......................default
  char_level_ppl .................. False.......................default
  checkpoint_in_cpu ............... False.......................default
  checkpoint_num_layers ........... 1...........................default
  checkpoint_validation_with_forward_pass  False................default
  contiguous_checkpointing ........ False.......................default
  deepscale ....................... False.......................default
  deepscale_config ................ None........................default
  deepspeed ....................... True........................default
  deepspeed_activation_checkpointing  True......................default
  deepspeed_mpi ................... False.......................default
  detect_nvlink_pairs ............. False.......................default
  distributed_backend ............. nccl........................default
  do_test ......................... None........................default
  do_train ........................ None........................default
  do_valid ........................ None........................default
  dump_state ...................... False.......................default
  eod_mask_loss ................... False.......................default
  eval_interval ................... 1000........................default
  eval_results_prefix ............. ............................default
  eval_tasks ...................... None........................default
  exclude ......................... None........................default
  exit_interval ................... None........................default
  finetune ........................ False.......................default
  flops_profiler .................. None........................default
  fp16_lm_cross_entropy ........... False.......................default
  fp32_allreduce .................. False.......................default
  git_hash ........................ 49e60fe.....................default
  gmlp_attn_dim ................... 64..........................default
  gpt_j_residual .................. False.......................default
  gradient_accumulation_steps ..... 1...........................default
  gradient_noise_scale_cpu_offload  False.......................default
  gradient_noise_scale_n_batches .. 5...........................default
  gradient_predivide_factor ....... 1.0.........................default
  hostfile ........................ None........................default
  hysteresis ...................... 2...........................default
  include ......................... None........................default
  init_method ..................... normal......................default
  init_method_std ................. 0.02........................default
  is_pipe_parallel ................ False.......................default
  iteration ....................... None........................default
  launcher ........................ pdsh........................default
  layernorm_epsilon ............... 1e-05.......................default
  lazy_mpu_init ................... False.......................default
  local_rank ...................... None........................default
  log_grad_norm ................... False.......................default
  log_gradient_noise_scale ........ False.......................default
  log_optimizer_states ............ False.......................default
  log_param_norm .................. False.......................default
  loss_scale ...................... None........................default
  loss_scale_window ............... 1000.0......................default
  make_vocab_size_divisible_by .... 128.........................default
  master_addr ..................... None........................default
  master_port ..................... 29500.......................default
  maximum_tokens .................. 64..........................default
  min_lr .......................... 0.0.........................default
  min_scale ....................... 1.0.........................default
  mmap_warmup ..................... False.......................default
  model_parallel_size ............. 1...........................default
  no_load_optim ................... False.......................default
  no_load_rng ..................... False.......................default
  no_save_optim ................... False.......................default
  no_save_rng ..................... False.......................default
  norm ............................ layernorm...................default
  num_gpus ........................ None........................default
  num_nodes ....................... -1..........................default
  num_samples ..................... 0...........................default
  num_unique_layers ............... None........................default
  num_workers ..................... 2...........................default
  onnx_safe ....................... False.......................default
  output_layer_init_method ........ scaled_normal...............default
  output_layer_parallelism ........ row.........................default
  override_lr_scheduler ........... False.......................default
  padded_vocab_size ............... None........................default
  param_sharing_style ............. grouped.....................default
  pipe_parallel_size .............. 0...........................default
  pipe_partition_method ........... type:transformer|mlp........default
  prescale_gradients .............. False.......................default
  profile_backward ................ False.......................default
  rank ............................ None........................default
  recompute ....................... False.......................default
  rms_norm_epsilon ................ 1e-08.......................default
  rotary_emb_base ................. 10000.......................default
  rotary_pct ...................... 1.0.........................default
  rpe_max_distance ................ 128.........................default
  rpe_num_buckets ................. 32..........................default
  sample_input_file ............... None........................default
  sample_output_file .............. None........................default
  scaled_masked_softmax_fusion .... False.......................default
  scaled_upper_triang_masked_softmax_fusion  False..............default
  scalenorm_epsilon ............... 1e-08.......................default
  scheduler ....................... None........................default
  seed ............................ 1234........................default
  short_seq_prob .................. 0.1.........................default
  soft_prompt_tuning .............. None........................default
  sparse_gradients ................ False.......................default
  steps_per_print ................. 10..........................default
  temperature ..................... 0.0.........................default
  test_data_paths ................. None........................default
  test_data_weights ............... None........................default
  text_gen_type ................... None........................default
  tokenizer_type .................. GPT2BPETokenizer............default
  top_k ........................... 0...........................default
  top_p ........................... 0.0.........................default
  train_data_paths ................ None........................default
  train_data_weights .............. None........................default
  use_bnb_optimizer ............... False.......................default
  use_checkpoint_lr_scheduler ..... False.......................default
  use_cpu_initialization .......... False.......................default
  valid_data_paths ................ None........................default
  valid_data_weights .............. None........................default
  wandb_host ...................... https://api.wandb.ai........default
  wandb_project ................... neox........................default
  wandb_team ...................... None........................default
  warmup .......................... 0.01........................default
  weight_by_num_documents ......... False.......................default
  weighted_sampler_alpha .......... 0.3.........................default
  world_size ...................... None........................default
  zero_allow_untested_optimizer ... False.......................default
---------------- end of arguments ----------------
[2021-12-15 02:09:34,263] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-12-15 02:09:34,263] [INFO] [runner.py:366:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 train.py --deepspeed_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true} --megatron_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 2, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small_offload.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe-parallel-size\": 0,\n   \"model-parallel-size\": 1,\n\n   # model settings\n   \"num-layers\": 12,\n   \"hidden-size\": 768,\n   \"num-attention-heads\": 12,\n   \"seq-length\": 2048,\n   \"max-position-embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos-emb\": \"rotary\",\n   \"no-weight-tying\": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled-upper-triang-masked-softmax-fusion\": false,\n   \"bias-gelu-fusion\": false,\n\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.0006,\n       \"betas\": [0.9, 0.999],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"zero_optimization\": {\n    \"stage\": 2,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n    \"cpu_offload\": True\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data-impl\": \"mmap\",\n   \"split\": \"949,50,1\",\n\n   # activation checkpointing\n   \"checkpoint-activations\": true,\n   \"checkpoint-num-layers\": 1,\n   \"partition-activations\": true,\n   \"synchronize-each-layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight-decay\": 0.0,\n   \"hidden-dropout\": 0.0,\n   \"attention-dropout\": 0.0,\n\n   # precision settings\n   \"fp16\": { \n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train-iters\": 320000,\n   \"lr-decay-iters\": 320000,\n   \"distributed-backend\": \"nccl\",\n   \"lr-decay-style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"save-interval\": 10000,\n   \"eval-interval\": 1000,\n   \"eval-iters\": 10,\n\n   # logging\n   \"log-interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep-last-n-checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data-path\": \"data/enron/enron_text_document\",\n  \n  # or for weighted datasets: \n  # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab-file\": \"data/gpt2-vocab.json\",\n  \"merge-file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n  \n  \"tensorboard-dir\": \"tensorboard\",\n  \"log-dir\": \"logs\",\n  \"use_wandb\": True,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}"}, "load": "checkpoints", "save_interval": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "use_wandb": true, "wandb_group": "2JTahMMU4myukKqf7nUqJp_35tzwh2k", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}
[2021-12-15 02:09:35,128] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-12-15 02:09:35,128] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-12-15 02:09:35,128] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]})
[2021-12-15 02:09:35,128] [INFO] [launch.py:104:main] dist_world_size=4
[2021-12-15 02:09:35,128] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1
> building GPT2BPETokenizer tokenizer ...
[2021-12-15 02:09:38,268] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-15 02:09:38,271] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later and do you have tensorboard installed?), no TensorBoard logs will be written.
> initializing torch distributed ...
[2021-12-15 02:09:38,306] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-15 02:09:38,312] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
> initializing model parallel with size 1
MPU DP: [0, 1, 2, 3]
MPU PP: [0]
MPU PP: [1]
MPU PP: [2]
MPU PP: [3]
MPU MP: [0]
MPU MP: [1]
MPU MP: [2]
MPU MP: [3]
> setting random seeds to 1234 ...
[2021-12-15 02:09:39,355] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/gpt-neox/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/gpt-neox/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3}
[2021-12-15 02:09:43,829] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=17
     0: EmbeddingPipe
     1: _pre_transformer_block
     2: ParallelTransformerLayerPipe
     3: ParallelTransformerLayerPipe
     4: ParallelTransformerLayerPipe
     5: ParallelTransformerLayerPipe
     6: ParallelTransformerLayerPipe
     7: ParallelTransformerLayerPipe
     8: ParallelTransformerLayerPipe
     9: ParallelTransformerLayerPipe
    10: ParallelTransformerLayerPipe
    11: ParallelTransformerLayerPipe
    12: ParallelTransformerLayerPipe
    13: ParallelTransformerLayerPipe
    14: _post_transformer_block
    15: NormPipe
    16: ParallelLinearPipe
  loss: partial
[2021-12-15 02:09:43,880] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-15 02:09:43,881] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
> learning rate decay style: cosine
DeepSpeed is enabled.
[2021-12-15 02:09:43,883] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=main
[2021-12-15 02:09:43,883] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-15 02:09:43,883] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-15 02:09:45,231] [INFO] [engine.py:654:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2021-12-15 02:09:45,231] [INFO] [engine.py:659:_configure_optimizer] Using client Optimizer as basic optimizer
[2021-12-15 02:09:45,231] [INFO] [engine.py:668:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
Checking ZeRO support for optimizer=FusedAdam type=
[2021-12-15 02:09:45,231] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2021-12-15 02:09:45,231] [INFO] [stage2.py:102:__init__] Reduce bucket size 500000000
[2021-12-15 02:09:45,231] [INFO] [stage2.py:103:__init__] Allgather bucket size 500000000
[2021-12-15 02:09:45,232] [INFO] [stage2.py:104:__init__] CPU Offload: True
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3600890636444092 seconds
Loading extension module utils...
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.40231990814208984 seconds
Time to load utils op: 0.4020683765411377 seconds
Time to load utils op: 0.40213465690612793 seconds
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 116, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 116, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 116, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
  File "train.py", line 27, in 
    pretrain(neox_args=neox_args)
  File "/gpt-neox/megatron/training.py", line 82, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 116, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
    self.initialize_optimizer_states()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
    self.optimizer.step()
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Killing subprocess 3914
Killing subprocess 3915
Killing subprocess 3916
Killing subprocess 3917
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 179, in 
    main()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 169, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'train.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true}', '--megatron_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 2, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small_offload.yml": "# GPT-2 pretraining setup\\n{\\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\\n   # across the node boundaries )\\n   \\"pipe-parallel-size\\": 0,\\n   \\"model-parallel-size\\": 1,\\n\\n   # model settings\\n   \\"num-layers\\": 12,\\n   \\"hidden-size\\": 768,\\n   \\"num-attention-heads\\": 12,\\n   \\"seq-length\\": 2048,\\n   \\"max-position-embeddings\\": 2048,\\n   \\"norm\\": \\"layernorm\\",\\n   \\"pos-emb\\": \\"rotary\\",\\n   \\"no-weight-tying\\": true,\\n\\n   # these should provide some speedup but takes a while to build, set to true if desired\\n   \\"scaled-upper-triang-masked-softmax-fusion\\": false,\\n   \\"bias-gelu-fusion\\": false,\\n\\n\\n   # optimizer settings\\n   \\"optimizer\\": {\\n     \\"type\\": \\"Adam\\",\\n     \\"params\\": {\\n       \\"lr\\": 0.0006,\\n       \\"betas\\": [0.9, 0.999],\\n       \\"eps\\": 1.0e-8,\\n     }\\n   },\\n   \\"zero_optimization\\": {\\n    \\"stage\\": 2,\\n    \\"allgather_partitions\\": True,\\n    \\"allgather_bucket_size\\": 500000000,\\n    \\"overlap_comm\\": True,\\n    \\"reduce_scatter\\": True,\\n    \\"reduce_bucket_size\\": 500000000,\\n    \\"contiguous_gradients\\": True,\\n    \\"cpu_offload\\": True\\n  },\\n\\n   # batch / data settings\\n   \\"train_micro_batch_size_per_gpu\\": 4,\\n   \\"data-impl\\": \\"mmap\\",\\n   \\"split\\": \\"949,50,1\\",\\n\\n   # activation checkpointing\\n   \\"checkpoint-activations\\": true,\\n   \\"checkpoint-num-layers\\": 1,\\n   \\"partition-activations\\": true,\\n   \\"synchronize-each-layer\\": true,\\n\\n   # regularization\\n   \\"gradient_clipping\\": 1.0,\\n   \\"weight-decay\\": 0.0,\\n   \\"hidden-dropout\\": 0.0,\\n   \\"attention-dropout\\": 0.0,\\n\\n   # precision settings\\n   \\"fp16\\": { \\n     \\"enabled\\": true,\\n     \\"loss_scale\\": 0,\\n     \\"loss_scale_window\\": 1000,\\n     \\"hysteresis\\": 2,\\n     \\"min_loss_scale\\": 1\\n   },\\n\\n   # misc. training settings\\n   \\"train-iters\\": 320000,\\n   \\"lr-decay-iters\\": 320000,\\n   \\"distributed-backend\\": \\"nccl\\",\\n   \\"lr-decay-style\\": \\"cosine\\",\\n   \\"warmup\\": 0.01,\\n   \\"save-interval\\": 10000,\\n   \\"eval-interval\\": 1000,\\n   \\"eval-iters\\": 10,\\n\\n   # logging\\n   \\"log-interval\\": 100,\\n   \\"steps_per_print\\": 10,\\n   \\"keep-last-n-checkpoints\\": 4,\\n   \\"wall_clock_breakdown\\": true,\\n}\\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\\n{\\n  \\"data-path\\": \\"data/enron/enron_text_document\\",\\n  \\n  # or for weighted datasets: \\n  # \\"train-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"test-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"valid-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"train-data-weights\\": [1., 2.],\\n  # \\"test-data-weights\\": [2., 1.],\\n  # \\"valid-data-weights\\": [0.5, 0.4],\\n\\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \\n  # WARNING: setting this to True will override any user provided weights\\n  # \\"weight_by_num_documents\\": false,\\n  # \\"weighted_sampler_alpha\\": 0.3,\\n\\n  \\"vocab-file\\": \\"data/gpt2-vocab.json\\",\\n  \\"merge-file\\": \\"data/gpt2-merges.txt\\",\\n\\n  \\"save\\": \\"checkpoints\\",\\n  \\"load\\": \\"checkpoints\\",\\n  \\"checkpoint_validation_with_forward_pass\\": False,\\n  \\n  \\"tensorboard-dir\\": \\"tensorboard\\",\\n  \\"log-dir\\": \\"logs\\",\\n  \\"use_wandb\\": True,\\n  \\"wandb_host\\": \\"https://api.wandb.ai\\",\\n  \\"wandb_project\\": \\"neox\\"\\n}"}, "load": "checkpoints", "save_interval": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "use_wandb": true, "wandb_group": "2JTahMMU4myukKqf7nUqJp_35tzwh2k", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}']' returned non-zero exit status 1.

Appreciate the heads up. Yeah, I'm just using the small config to debug CPU offload. I don't intend to use it to train a model that small.

CPU offload has worked in the past with model parallel on the latest DeepSpeed using https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-ZeRO3. I don't have steps to reproduce on me right now.

@mallorbc You can use model parallelism, which does work with cpu offload (at least, it should once we sort this issue out). It also divides the model out across multiple GPUs and is even more memory efficient than pipeline parallelism (at the cost of computation time). (source: https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/, "Understand the tradeoffs of data, model, and pipeline parallelism" section)