EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
https://www.eleuther.ai/
Apache License 2.0
6.96k stars 1.02k forks source link

RuntimeError: stack expects each tensor to be equal size #929

Closed cateto closed 1 year ago

cateto commented 1 year ago

Describe the bug when i train this model by config below (train_micro_batch_size_per_gpu=100), raise runtime error. but i try to set train_micro_batch_size_per_gpu < 100. it works. but i want to use full gpu memory..! please let me know

ML-01: training ...
ML-01: Traceback (most recent call last):
ML-01:   File "train.py", line 27, in <module>
ML-01:     pretrain(neox_args=neox_args)
ML-01:   File "/home/research/gpt-neox/megatron/training.py", line 226, in pretrain
ML-01:     iteration = train(
ML-01:   File "/home/research/gpt-neox/megatron/training.py", line 778, in train
ML-01:     loss_dict, skipped_iter = train_step(
ML-01:   File "/home/research/gpt-neox/megatron/training.py", line 684, in train_step
ML-01:     reduced_loss = train_step_pipe(
ML-01:   File "/home/research/gpt-neox/megatron/training.py", line 734, in train_step_pipe
ML-01:     loss = model.train_batch(data_iter=data_iterator)
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
ML-01:     self._exec_schedule(sched)
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1374, in _exec_schedule
ML-01:     self._exec_instr(**cmd.kwargs)
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 790, in _exec_load_micro_batch
ML-01:     batch = self._next_batch()
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 622, in _next_batch
ML-01:     batch = next(self.data_iterator)
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
ML-01:     data = self._next_data()
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
ML-01:     return self._process_data(data)
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
ML-01:     data.reraise()
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
ML-01:     raise exception
ML-01: RuntimeError: Caught RuntimeError in DataLoader worker process 0.
ML-01: Original Traceback (most recent call last):
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
ML-01:     data = fetcher.fetch(index)
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
ML-01:     return self.collate_fn(data)
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 162, in default_collate
ML-01:     return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 162, in <dictcomp>
ML-01:     return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 151, in default_collate
ML-01:     return default_collate([torch.as_tensor(b) for b in batch])
ML-01:   File "/home/research/anaconda3/envs/gpt_neox_py38/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in default_collate
ML-01:     return torch.stack(batch, 0, out=out)
ML-01: RuntimeError: stack expects each tensor to be equal size, but got [2049] at entry 0 and [6805] at entry 30

Environment (please complete the following information):

Additional context Add any other context about the problem here.

StellaAthena commented 1 year ago

There are several strange things in your config file. Can tell us a bit more about what you’re trying to do? Specifically:

  1. You probably shouldn’t be using a MBS of 100 ever.
  2. You’re using MP = 2 and PP = 2, but your model is small enough that it should be trainable without either. You can fit the entire model on a single GPU and then do pure data parallelism.
  3. Is it correct that you have two nodes each with two A100s, and not four A100s on a single node?

It seems plausible that we might have a data loader bug that only appears in excess of a MBS of 100 because we’ve probably never tested it that large. However our data loader is largely the same as Megatron-DeepSpeed’s… have you tried that code base to see if it works with a MBS of 100?

Quentin-Anthony commented 1 year ago

There are several strange things in your config file. Can tell us a bit more about what you’re trying to do? Specifically:

  1. You probably shouldn’t be using a MBS of 100 ever.
  2. You’re using MP = 2 and PP = 2, but your model is small enough that it should be trainable without either. You can fit the entire model on a single GPU and then do pure data parallelism.
  3. Is it correct that you have two nodes each with two A100s, and not four A100s on a single node?

It seems plausible that we might have a data loader bug that only appears in excess of a MBS of 100 because we’ve probably never tested it that large. However our data loader is largely the same as Megatron-DeepSpeed’s… have you tried that code base to see if it works with a MBS of 100?

This issue should have been fixed in https://github.com/EleutherAI/gpt-neox/pull/835

@cateto -- Following up on @StellaAthena's questions, please try setting model-parallel-size and pipe-parallel-size to 1, which will improve performance but force you to reduce the batch size. Also, please share your commit hash here so that we can investigate further if the above PR didn't fix this issue properly.

StellaAthena commented 1 year ago

@cateto hey, any updates on this?

cateto commented 1 year ago

Solved it ! Thank you. with 2 nodes (2 * A100 80GB)

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 1,
   "model-parallel-size": 1,
   "num_nodes": 2,

   # model settings
   "num-layers": 12,
   "hidden-size": 768,
   "num-attention-heads": 12,
   "seq-length": 2048,
   "max-position-embeddings": 2048,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,
   "gpt_j_residual": false,
   "output_layer_parallelism": "column",

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": false,
   "bias-gelu-fusion": false,

   # init methods
   "init_method": "small_init",
   "output_layer_init_method": "wang_init",

   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.0006,
       "betas": [0.9, 0.95],
       "eps": 1.0e-8,
     }
   },
  "min_lr": 0.00006,

   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
   "zero_optimization": {
    "stage": 1,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 55,
   "gradient_accumulation_steps": 2,
   "data-impl": "mmap",
   #"split": "949,50,1",

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0.1,
   "hidden-dropout": 0.0,
   "attention-dropout": 0.0,

   # precision settings
   "fp16": {
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train-iters": 100000,
   "lr-decay-iters": 100000,
   "distributed-backend": "nccl",
   "lr-decay-style": "cosine",
   "warmup": 0.01,
   "checkpoint-factor": 2500,
   "eval-interval": 1000,
   "eval-iters": 10,

   # logging
   "log-interval": 100,
   "steps_per_print": 10,
   "keep-last-n-checkpoints": 450,
   "wall_clock_breakdown": true,

}