Formatting for grouped metrics.

TezRomacH commented 3 years ago

🚀 Feature

Motivation

With Lightning, I use Tensorboard and ClearML, which allow you to group metrics in one tab. The "/" is used for grouping. In the screenshots from tb and ClearML you can see how they are grouped now.

Pitch

The first is that when you select on_step=True and on_epoch=True, a postfix is appended to the end and the metrics by epoch and step are grouped into one in ClearML. I would like to split this group into two: loss_step/<train or valid> and loss_epoch/<train or valid>. Could the name parameter in self.log be formatted? What do you think? This would be something like name="loss_{interval}/valid".
The second problem also involves formatting. When you pass a grouped metric to filename in ModelCheckpoint, Lightning will think "/" means to create a folder :slightly_smiling_face: See screenshot. Would it be possible to add an escaping for slash and replace it with a lower underscore?

Alternatives

Additional context

Similar problem, but with W&B in Lightning-Flash https://github.com/PyTorchLightning/lightning-flash/issues/818

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Programmer-RD-AI commented 3 years ago

hi I would like to try this issue

rohitgr7 commented 3 years ago

just some workarounds:

Log them separately, and lightning won't alter the keys


def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("loss_step/train", loss, on_step=True, on_epoch=False)
self.log("loss_epoch/train", loss, on_step=False, on_epoch=True)
return loss

def validation_step(self, batch, batch_idx): loss = self(batch).sum() self.log("loss_step/valid", loss, on_step=True, on_epoch=False) self.log("loss_epoch/valid", loss, on_step=False, on_epoch=True)


2. Configure log_metrics inside CustomLogger
```py
class CustomTensorboardLogger(pl.loggers.TensorBoardLogger):
    def log_metrics(self, metrics, step):
        metrics = .... <- update keys here
        super().log_metrics(metrics, step)

trainer = Trainer(..., logger=CustomTensorboardLogger(save_dir='.', name='lightning_logs'))

TezRomacH commented 3 years ago

@rohitgr7 I like this! Very clever solutions :)

But hopefully, it will be possible to do it out of the box in the pl :)

TezRomacH commented 3 years ago

Configure log_metrics inside CustomLogger

class CustomTensorboardLogger(pl.loggers.TensorBoardLogger):
    def log_metrics(self, metrics, step):
        metrics = .... <- update keys here
        super().log_metrics(metrics, step)

trainer = Trainer(..., logger=CustomTensorboardLogger(save_dir='.', name='lightning_logs'))

Did you mean custom ModelCheckpoint here? Because it's the ModelCheckpoint that does not save the file exactly as it is supposed to.

rohitgr7 commented 3 years ago

ok yeah. ignore the 2nd solution, it won't work with both logger and model_checkpoint.

TezRomacH commented 3 years ago

I would like to contribute the fix at least for the second feature. As far as I can see it will be enough to add character escaping to def format_checkpoint_name method in ModelCheckpoint

TezRomacH commented 3 years ago

just some workarounds:

Log them separately, and lightning won't alter the keys

def training_step(self, batch, batch_idx):
    loss = self(batch).sum()
    self.log("loss_step/train", loss, on_step=True, on_epoch=False)
    self.log("loss_epoch/train", loss, on_step=False, on_epoch=True)
    return loss

def validation_step(self, batch, batch_idx):
    loss = self(batch).sum()
    self.log("loss_step/valid", loss, on_step=True, on_epoch=False)
    self.log("loss_epoch/valid", loss, on_step=False, on_epoch=True)

Will it work the same way if I log not a loss aka torch.Tensor or float, but torchmetrics.Metric instance (see the screenshot)? Will the underlying logic work fine with update(), compute(), reset() with this duplicated logging?

rohitgr7 commented 3 years ago

I think it should but not 100% sure. There might be some re-computations though internally.

rohitgr7 commented 3 years ago

for the second case maybe this can work

class CustomModelCheckpoint(pl.callbacks.ModelCheckpoint):
    def format_checkpoint_name(self, *args, **kwargs):
        filepath = super().format_checkpoint_name(*args, **kwargs)
        filepath = filepath.replace(# replace `/` with something else)
        return filepath

wdyt?

TezRomacH commented 3 years ago

wdyt?

I need to check this in my pipeline

hmm, could we maybe add a new parameter escaped_chars: Optional[List[str]] = None to pl.callbacks.ModelCheckpoint?

by default, it will have the value None and no chars will be escaped, so the code will remain backward-compatible.

Lightning-AI / pytorch-lightning