datamol-io / graphium

Graphium: Scaling molecular GNNs to infinity.
https://graphium-docs.datamol.io/
Apache License 2.0
190 stars 12 forks source link

Support poptorch compilation without `traceModel(True)` #134

Closed hatemhelal closed 1 year ago

hatemhelal commented 1 year ago

I can work on figuring out how to support the default compiler mode so that we can remove this line

_Originally posted by @hatemhelal in https://github.com/valence-discovery/goli/pull/123#discussion_r976340055_

DomInvivo commented 1 year ago

@hatemhelal , check the new test that I added test_poptorch_deviceiterations_gradient_accumulation on the branch grad_accum. In this test, I created a very simple model to test whether the gradient accumulation and device iteration work well.

Notice the line opts.Jit.traceModel(True). If you remove it, then you get the error below

Traceback (most recent call last):
  File "/home/dom/goli/tests/test_ipu_dataloader.py", line 191, in test_poptorch_deviceiterations_gradient_accumulation
    trainer.fit(model=model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 171, in advance
    batch = next(data_fetcher)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
    return self.fetching_function()
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 269, in fetching_function
    return self.move_to_device(batch)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 284, in move_to_device
    batch = self.batch_to_device(batch)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in batch_to_device
    return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 291, in _apply_batch_transfer_handler
    batch = hook(batch, device, dataloader_idx)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 713, in transfer_batch_to_device
    return move_data_to_device(batch, device)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 354, in move_data_to_device
    return apply_to_collection(batch, dtype=dtype, function=batch_to)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 121, in apply_to_collection
    v = apply_to_collection(
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/dom/.venv/goli_ipu/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 347, in batch_to
    data_output = data.to(device, **kwargs)
poptorch.poptorch_core.Error: In poptorch/source/dispatch_tracer/RegisterAtenOverloads.cpp:88: 'poptorch_cpp_error': There is no active dispatch
callumm-graphcore commented 1 year ago

Hi Dominique, I've now reproduced this error and will continue to investigate it.

DomInvivo commented 1 year ago

Thanks @callumm-graphcore ! In fixing this issue, you might need to move to a newer version of pytorch-lightning.

See issue #126 where we also discuss moving to lightning >1.7.0

DomInvivo commented 1 year ago

@callumm-graphcore and @hatemhelal , is this something you are actively working on, if so what would be the ETA?

Or is it something your planning for a few weeks in the future?

callumm-graphcore commented 1 year ago

Hi @DomInvivo, this is something I am actively working on, along with the MuTransfer work. I'm still working on scoping the amount of work that needs to be done here so I don't feel that I can give an accurate ETA yet.

The good news is that I've confirmed that the dispatcher supports dictionaries as outputs (and inputs, as long as the keys of the dictionary are fixed) so hopefully by moving to the dispatcher we can deprecate/remove most of IPUPluginGoli, which is where some of the immediate problems from transitioning to the dispatcher arise.

DomInvivo commented 1 year ago

Fixed in PR #179