EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
https://www.eleuther.ai/
Apache License 2.0
6.86k stars 996 forks source link

AssertionError: zero stage 1 requires an optimizer #987

Open yonglianglan opened 1 year ago

yonglianglan commented 1 year ago

An error occurred when using the evaluation code。 the command is: python ./deepy.py evaluate.py xxxx.yml --eval_tasks piqa

image

The training mode is multi-machine mode and the evaluation mode is single-machine mode.

Has anyone had a similar issue? thanks!

StellaAthena commented 1 year ago

This is a known issue that is awkward to handle. Our current recommendation is to set ZeRO stage 0 when calling the evaluation script. We are working on integrating DeepSpeed Inference which will solve this issue and substantially accelerate inference tasks as well.

vsabavat commented 10 months ago

Is this bug resolved? How do we pass the or set the ZeRo stage 1? I also see the same error during inference.

python ./deepy.py generate.py -d configs 125M local_setup text_generation


  File "generate.py", line 91, in <module>
    main()
  File "generate.py", line 33, in main
    model, neox_args = setup_for_inference_or_eval(use_cache=True)
  File "/localhome/local-vsabavat/ai/training/gpt-neox/megatron/utils.py", line 448, in setup_for_inference_or_eval
    model, _, _ = setup_model_and_optimizer(
  File "/localhome/local-vsabavat/ai/training/gpt-neox/megatron/training.py", line 647, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/localhome/local-vsabavat/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 186, in initialize
    engine = PipelineEngine(args=args,
  File "/localhome/local-vsabavat/.local/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 68, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/localhome/local-vsabavat/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 309, in __init__
    self.optimizer = self._configure_zero_optimizer(optimizer=None)
  File "/localhome/local-vsabavat/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1468, in _configure_zero_optimizer
    assert not isinstance(optimizer, DummyOptim), "zero stage {} requires an optimizer".format(zero_stage)
AssertionError: zero stage 1 requires an optimizer```
AIproj commented 10 months ago

@vsabavat In one of your yml config files you should have something that looks like

  "zero_optimization": {
    "stage": 1,
    "allgather_partitions": true,
    "allgather_bucket_size": 1260000000,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 1260000000,
    "contiguous_gradients": true,
    "cpu_offload": false
  },

In my example, the stage is set to 1.