Lightning-Universe / lightning-Horovod

Lightning Training strategy for Horovod
Apache License 2.0
1 stars 1 forks source link

accumulation_scheduler does not exist #15

Open zjost opened 1 year ago

zjost commented 1 year ago

🐛 Bug

Trainer.accumulation_scheduler does not exist, which makes the strategy code fail.

To Reproduce

Steps to reproduce the behavior:

trainer = Trainer(..., accelerator='gpu', strategy=HorovodStrategy())
trainer.fit(...)
/tmp/ipykernel_1183/1481180612.py in <cell line: 1>()
----> 1 trainer.fit(model=autoencoder, train_dataloaders=train_loader)

/home/default_user/.conda/envs/user/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py in fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    518         model = _maybe_unwrap_optimized(model)
    519         self.strategy._lightning_module = model
--> 520         call._call_and_handle_interrupt(
    521             self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    522         )

/home/default_user/.conda/envs/user/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     42             return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     43         else:
---> 44             return trainer_fn(*args, **kwargs)
     45 
     46     except _TunerExitException:

/home/default_user/.conda/envs/user/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py in _fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    557             model_connected=self.lightning_module is not None,
    558         )
--> 559         self._run(model, ckpt_path=ckpt_path)
    560 
    561         assert self.state.stopped

/home/default_user/.conda/envs/user/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py in _run(self, model, ckpt_path)
    909 
    910         # strategy will configure model and move it to the device
--> 911         self.strategy.setup(self)
    912 
    913         # hook

/home/default_user/.conda/envs/user/lib/python3.10/site-packages/lightning_horovod/strategy.py in setup(self, trainer)
    148             hvd.broadcast_optimizer_state(optimizer, root_rank=0)
    149 
--> 150         accumulation_scheduler = trainer.accumulation_scheduler
    151         if accumulation_scheduler.epochs != [0]:
    152             raise MisconfigurationException(

AttributeError: 'Trainer' object has no attribute 'accumulation_scheduler'

Environment

Other info

On a side note, the various documentation sources do not really explain how to use Horovod + Lightning in a way that works. Lightning documentation refer to this repo (not easy to find). This repo refers to Horovod docs. Horovod docs don't mention this repo, but say pl.Trainer(accelerator='horovod'), or pl.Trainer(distributed_backend='horovod'), neither of which work. The README says trainer = Trainer(strategy="horovod", accelerator="gpu", devices=1), but that doesn't work either. I ended up using the CPU example of strategy=HorovodStrategy(), but then also specifying accelerator='gpu'.

github-actions[bot] commented 1 year ago

Hi! thanks for your contribution!, great first issue!

uday-rao-aera commented 1 year ago

got same error. what is fix for this?