EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
https://www.eleuther.ai/
Apache License 2.0
6.81k stars 988 forks source link

Deepspeed zero optimizer, error converting model checkpoints #594

Closed MatejUlcar closed 1 year ago

MatejUlcar commented 2 years ago

Describe the bug Unable to convert a custom gpt-neox model (with zero stage 3) checkpoints using zero_to_fp32.py script.

To Reproduce Train a model with zero stage 3, pp=0, mp=1 (haven't attempted other combinations). Save checkpoint. Run the zero_to_fp32.py script, attached to the saved checkpoint.

Environment (please complete the following information):

Attempted so far Run as is:

Detected checkpoint of type zero stage 3, world_size: 4
Traceback (most recent call last):
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 151, in <module>
    convert_zero_chkpt_to_fp32_consolid_state_dict(args.checkpoint_dir, args.output_file)
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 121, in convert_zero_chkpt_to_fp32_consolid_state_dict
    state_dict[name] = torch.cat(
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper___cat)

Limit env to 1 GPU:

Traceback (most recent call last):
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 151, in <module>
    convert_zero_chkpt_to_fp32_consolid_state_dict(args.checkpoint_dir, args.output_file)
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 83, in convert_zero_chkpt_to_fp32_consolid_state_dict
    zero_stage, world_size, param_shapes, fp32_flat_groups = parse_optim_states(optim_files)
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 39, in parse_optim_states
    state_dicts.append(torch.load(f))
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 607, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 882, in _load
    result = unpickler.load()
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 857, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 846, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 142, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on CUDA device '
RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device.

Adding map_location argument to torch.load as per above error

Detected checkpoint of type zero stage 3, world_size: 4
Traceback (most recent call last):
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 151, in <module>
    convert_zero_chkpt_to_fp32_consolid_state_dict(args.checkpoint_dir, args.output_file)
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 122, in convert_zero_chkpt_to_fp32_consolid_state_dict
    tuple(fp32_flat_groups[i].narrow(0,
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 122, in <genexpr>
    tuple(fp32_flat_groups[i].narrow(0,
RuntimeError: start (27432576) + length (6168576) exceeds dimension size (33570816).

I have also attempted to convert with deepspeed's latest script, but I got an error that it's not a model state checkpoint, so I figure the differences between current deeperspeed and latest deepspeed are too great to be of any use. Please advice on how I could salvage, ie. actually use the model for inference/evaluation other than training again from scratch.

StellaAthena commented 2 years ago

Hmmm. We haven’t really been using ZeRO 3, as our testing indicates that it’s a big step up in complexity for a minimal increase in performance. It’s possible we broke something without realizing it.

I’ll have to train a model and play with it a bit, can you post the config file? Alternatively, would you be able to share the trained model by any chance? Hopefully we can get this resolved.

MatejUlcar commented 2 years ago

I've figured it's less bother to train from scratch without ZeRO 3, seeing how GPUs were under-utilized in the original run. Either way, I can share the trained model https://drive.google.com/drive/folders/13Z4g4eGFd33yhI2HWj4EM4G6lGw2W24D?usp=sharing The config files included in the archive, the model config here:

{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 0,
   "model-parallel-size": 1,

   # model settings
   "num-layers": 12,
   "hidden-size": 768,
   "num-attention-heads": 12,
   "seq-length": 2048,
   "max-position-embeddings": 2048,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,
    # this should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": false,
   "train-iters": 80000,

   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 1.0e-5,
       #"freeze_step": 5000,
       "betas": [0.9, 0.999],
       #"cuda_aware": false,
       #"comm_backend_name": "nccl"
     }
   },

   "zero_optimization": {
    "stage": 3,
    "allgather_partitions": True,
    "allgather_bucket_size": 100000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 100000000,
    "contiguous_gradients": True,
    "cpu_offload": False
  },
  "zero_allow_untested_optimizer": true,

   # batch / data settings
   "train_micro_batch_size_per_gpu": 8,
   "gradient_accumulation_steps": 4,
   "data-impl": "mmap",
   "split": "949,50,1",

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0.05,
   "hidden-dropout": 0.1,
   "attention-dropout": 0.1,

   # precision settings
   "fp16": { 
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 500,
     "hysteresis": 2,
     "min_loss_scale": 1,
   },

   # lr decay settings
   "lr-decay-iters": 80000,
   "lr-decay-style": "cosine",
   "warmup": 0.01,

   # misc. training settings
   "distributed-backend": "nccl",
   #"save-interval": 10000,
   #"eval-interval": 1000,
   "save-interval": 500,
   "eval-interval": 100,
   "eval-iters": 10,

   # logging
   #"log-interval": 100,
   "log-interval": 10,
   "steps_per_print": 10,
   "keep-last-n-checkpoints": 4,
   "wall_clock_breakdown": true,

  # sparse attention
  #"attention_config": [[["local", "global"], "all"]],
}
StellaAthena commented 1 year ago

We do not currently support ZeRO 3, which seems to be the core source of your issue. Closing for now.