Closed jacanchaplais closed 3 years ago
@jacanchaplais thanks for providing feedback. Was trying to reproduce your code but I believe that to inspect in deep we need more parts of the code that you didn't provide like the gnn.loader
and at least a minimal data sample.
In the meantime I created this small repo to play around with this issue: https://github.com/edgarriba/pl_issue_7671
Please, let us know your thoughts so that we can help to solve this issue.
Thanks @edgarriba. I use the data loader class provided by PyTorch Geometric, and my full code can be seen here https://github.com/jacanchaplais/cluster_gnn.
The data loader is defined here https://github.com/jacanchaplais/cluster_gnn/blob/main/src/cluster_gnn/data/loader.py.
As I'm prototyping stuff, I wrote the data processing separately in a Jupyter notebook found here https://github.com/jacanchaplais/cluster_gnn/blob/main/notebooks/convert_data.ipynb.
Here is a small sample data set of 100 graphs. small_data.hdf5.zip
Although, have you tried reproducing this error on a simpler case, like the standard MNIST classifier? If not, might be better to see if it is a problem there before trying to reproduce my rather specific case, as it hasn't yet been polished for easy portability.
EDIT:
if you do want to install my codebase, you can do conda env create -f environment.yml
, followed by bash pyg-pip.sh ptg
, and finally pip install -e .
.
I should note that I have reproduced this myself with MNIST, see https://github.com/optuna/optuna-examples/blob/c8df375e5bd9d741538491f87d607244ae6e9746/pytorch/pytorch_lightning_simple.py.
Optuna have since changed the number of GPUs they use in this example to 1, rather than a variable number, as I informed them of the issue that I'm reporting in the current thread.
@jacanchaplais thanks for the insights. I'll investigate this in detail.
after some investigation - seems that after this call to mp.spawn
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/plugins/training_type/ddp_spawn.py#L157
the self.lightning_module.trainer
is touched and the callback_metrics
variable is cleared. In addition, I have noticed that self.lightning_module.trainer
is a weak-reference object which makes me suspect that somehow is de-referenced, but not sure about this one. /cc @awaelchli @justusschock do you have any intuition about this ?
~@edgarriba From my feeling you might be right, when passing a weakref proxy the original object might not exist anymore in the new process. maybe we can use the passed trainer instead of the one in mp_kwargs
? This one shouldn't be a proxy IIRC~
EDIT: nvm my reply, @awaelchli is right, didn't think of this.
Hey guys! Don't get mislead by this. From what I can see the OP is trying to access the callback_metrics outside in the main process:
trainer.fit(model, data_module)
print('callback metrics are:\n {}'.format(trainer.callback_metrics))
Please note that in DDP spawn the main process never trains, therefore there are no callback metrics! The only thing it does is wait for the worker processes to enter join() when finished. This is just how it is in ddp spawn. Lightning will make sure to add the weights to the queue so they get back to the main process but that's pretty much all.
The recommendation is always to avoid ddp spawn whenever possible. So my recommendation is accelerator="ddp"
.
@awaelchli you might be right, but the process gets blocked when I try with ddp
@edgarriba ddp
doesn't work with notebooks, that's the only reason we still have ddp_spawn
around. So the callback_metrics
in spawn are populated just not in the main process, since that one just waits and does nothing.
@justusschock I'm running in an aws instance
I thought no form of ddp works in notebooks? also can confirm in ddp_spawn
that callback_metrics
work internally
Yes, no form of plain DDP (since we usually call the script multiple times which is not possible in jupyter), but with spawn we spawn processes that are not tied to the script but to one specific function we pass (and thus they work)
Dear @jacanchaplais,
Any progress on this issue ?
Responses resume:
Best, T.C
I will test specifying accelerator="ddp"
soon and get back to you, thanks for the updates.
🐛 Bug
When
plt.Trainer(gpus>1, ...)
, thecallback_metrics
dictionary appears not to be populated. I've tried to both integrate Optuna and Ray Tune, and have failed with both as a result.To ensure this was the issue, I printed the
trainer.callback_metrics
attribute:It does, however, work with 1 GPU. Unfortunately for me, my datasets are graphs, and they are so large that I can only fit one into memory at a time, so the number of GPUs = batch size, and tuning with a batch size of 1 might not be very indicative. Any help much appreciated!
Environment
Hardware
Software
Code
I attach the LightningModule below, which uses TorchMetrics and the self.log features, as per the docs. I did (in desperation) try setting the callback_metrics dictionary myself in
validation_epoch_end()
, but that didn't work. Neither did settingsync_dist=True
in theself.log()
calls.Here I attach the tuning script using
Ray[Tune]
, as per their docs https://docs.ray.io/en/master/tune/tutorials/tune-pytorch-lightning.html.