Open manavkulshrestha opened 2 months ago
@manavkulshrestha could you provide how you log the metrics in training and validation?
Sure! I'm doing it in the Model.(pl.LightningModule)._step(self, batch, batch_idx, *, split, **kwargs)
function. Note that both training_step
and validation_step
call this function with different split
string argument ('train'
and 'val'
, respectively).
#...
log_dict = {'step': self.current_epoch, f'loss/{split}': loss}
for name, metric in self.split_metrics[f'{split}_metrics'].items():
metric.update(y_pred_prob if needs_probability(name) else y_pred_label, y)
log_dict['/'.join(name.split('_')[::-1])] = metric
self.log_dict(log_dict, on_step=False, on_epoch=True, prog_bar=False, logger=True, sync_dist=True)
#...
That looks fine to me, nothing that should make the process hang. Could you also share your implementation of self._on_epoch_end(*args, **kwargs, split='train')
since that is where the code is hanging?
Hi, this is the code for that:
def _on_epoch_end(self, *, split:str):
for metric in self.split_metrics[{split}_metrics'].values():
metric.reset()
...
def on_train_epoch_end(self, *args, **kwargs):
self._on_epoch_end(*args, **kwargs, split='train')
def on_validation_epoch_end(self, *args, **kwargs):
self._on_epoch_end(*args, **kwargs, split='val')
def on_test_epoch_end(self, *args, **kwargs):
self._on_epoch_end(*args, **kwargs, split='test')
are you still facing this issue @manavkulshrestha?
Found this thread because I encountered too, in my case I was only doing evaluation on the master process. Following this answer in a similar issue, it seems this is what causes the problem, I changed to calling it on all processes and the issue is now gone.
Hope this helps!
Bug description
I'm using the default Accuracy metric (though it appears to be true for any metric), and calling metric.compute() hangs after the first epoch and never resolves (ran it overnight, never progressed). It seems, as per some print() statements, that the issue is only with metric computation after training epoch ends, not after validation epoch ends. Issue does not happen for when using only 1 gpu or cpu. It is also agnostic of how large the dataset is, I tried with a dataset having only the first 2 batches and got the same result. I see there's another relevant issue (#5930) from 3 years ago, but has no solution (just says to update version and make a new issue).
What version are you seeing the problem on?
v2.4
How to reproduce the bug
In Model(pl.LightningModule).init(self, splits, #more args):
(Note: I considered using a metric collection, but some of my metrics need different inputs and I couldn't figure out how to account for that)
In Model.(pl.LightningModule)._step(self, batch, batch_idx, *, split, **kwargs):
Relevant overloads in Model(pl.LightningModule):
Error messages and logs
I put some print statements that show it hangs for train_acc. This is with 2 GPUs.
Environment
Current environment
``` #- PyTorch Lightning Version (e.g., 2.4.0): 2.4.0 #- PyTorch Version (e.g., 2.4): 2.4.0 #- TorchMetrics Version: 1.4.1 #- Python version (e.g., 3.12): 3.11.9 #- OS (e.g., Linux): Linux #- CUDA/cuDNN version: 12.1 #- GPU models and configuration: 8 x Tesla V100-SXM2-16GB #- How you installed Lightning(`conda`, `pip`, source): pip ```More info
No response