Closed hatemhelal closed 1 year ago
@hatemhelal , check the new test that I added test_poptorch_deviceiterations_gradient_accumulation
on the branch grad_accum
. In this test, I created a very simple model to test whether the gradient accumulation and device iteration work well.
Notice the line opts.Jit.traceModel(True)
. If you remove it, then you get the error below
Traceback (most recent call last):
File "/home/dom/goli/tests/test_ipu_dataloader.py", line 191, in test_poptorch_deviceiterations_gradient_accumulation
trainer.fit(model=model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 171, in advance
batch = next(data_fetcher)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
return self.fetching_function()
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 269, in fetching_function
return self.move_to_device(batch)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 284, in move_to_device
batch = self.batch_to_device(batch)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in batch_to_device
return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 291, in _apply_batch_transfer_handler
batch = hook(batch, device, dataloader_idx)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 713, in transfer_batch_to_device
return move_data_to_device(batch, device)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 354, in move_data_to_device
return apply_to_collection(batch, dtype=dtype, function=batch_to)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 121, in apply_to_collection
v = apply_to_collection(
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
return function(data, *args, **kwargs)
File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 347, in batch_to
data_output = data.to(device, **kwargs)
poptorch.poptorch_core.Error: In poptorch/source/dispatch_tracer/RegisterAtenOverloads.cpp:88: 'poptorch_cpp_error': There is no active dispatch
Hi Dominique, I've now reproduced this error and will continue to investigate it.
Thanks @callumm-graphcore ! In fixing this issue, you might need to move to a newer version of pytorch-lightning.
See issue #126 where we also discuss moving to lightning >1.7.0
@callumm-graphcore and @hatemhelal , is this something you are actively working on, if so what would be the ETA?
Or is it something your planning for a few weeks in the future?
Hi @DomInvivo, this is something I am actively working on, along with the MuTransfer work. I'm still working on scoping the amount of work that needs to be done here so I don't feel that I can give an accurate ETA yet.
The good news is that I've confirmed that the dispatcher supports dictionaries as outputs (and inputs, as long as the keys of the dictionary are fixed) so hopefully by moving to the dispatcher we can deprecate/remove most of IPUPluginGoli
, which is where some of the immediate problems from transitioning to the dispatcher arise.
Fixed in PR #179
I can work on figuring out how to support the default compiler mode so that we can remove this line
_Originally posted by @hatemhelal in https://github.com/valence-discovery/goli/pull/123#discussion_r976340055_