Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.33k stars 3.38k forks source link

Formatting for grouped metrics. #10097

Open TezRomacH opened 3 years ago

TezRomacH commented 3 years ago

šŸš€ Feature

Motivation

With Lightning, I use Tensorboard and ClearML, which allow you to group metrics in one tab. The "/" is used for grouping. In the screenshots from tb and ClearML you can see how they are grouped now.

Screenshot 2021-10-23 at 22 14 43 Screenshot 2021-10-23 at 22 22 04 Screenshot 2021-10-23 at 22 23 46

Pitch

  1. The first is that when you select on_step=True and on_epoch=True, a postfix is appended to the end and the metrics by epoch and step are grouped into one in ClearML. I would like to split this group into two: loss_step/<train or valid> and loss_epoch/<train or valid>. Could the name parameter in self.log be formatted? What do you think? This would be something like name="loss_{interval}/valid".

  2. The second problem also involves formatting. When you pass a grouped metric to filename in ModelCheckpoint, Lightning will think "/" means to create a folder :slightly_smiling_face: See screenshot. Would it be possible to add an escaping for slash and replace it with a lower underscore?

    Screenshot 2021-10-23 at 22 24 12 Screenshot 2021-10-23 at 22 24 42

Alternatives

Additional context

Similar problem, but with W&B in Lightning-Flash https://github.com/PyTorchLightning/lightning-flash/issues/818


If you enjoy Lightning, check out our other projects! āš”

Programmer-RD-AI commented 3 years ago

hi I would like to try this issue

rohitgr7 commented 3 years ago

just some workarounds:

  1. Log them separately, and lightning won't alter the keys
    
    def training_step(self, batch, batch_idx):
    loss = self(batch).sum()
    self.log("loss_step/train", loss, on_step=True, on_epoch=False)
    self.log("loss_epoch/train", loss, on_step=False, on_epoch=True)
    return loss

def validation_step(self, batch, batch_idx): loss = self(batch).sum() self.log("loss_step/valid", loss, on_step=True, on_epoch=False) self.log("loss_epoch/valid", loss, on_step=False, on_epoch=True)


2. Configure log_metrics inside CustomLogger
```py
class CustomTensorboardLogger(pl.loggers.TensorBoardLogger):
    def log_metrics(self, metrics, step):
        metrics = .... <- update keys here
        super().log_metrics(metrics, step)

trainer = Trainer(..., logger=CustomTensorboardLogger(save_dir='.', name='lightning_logs'))
TezRomacH commented 3 years ago

@rohitgr7 I like this! Very clever solutions :)

But hopefully, it will be possible to do it out of the box in the pl :)

TezRomacH commented 3 years ago
  1. Configure log_metrics inside CustomLogger
class CustomTensorboardLogger(pl.loggers.TensorBoardLogger):
    def log_metrics(self, metrics, step):
        metrics = .... <- update keys here
        super().log_metrics(metrics, step)

trainer = Trainer(..., logger=CustomTensorboardLogger(save_dir='.', name='lightning_logs'))

Did you mean custom ModelCheckpoint here? Because it's the ModelCheckpoint that does not save the file exactly as it is supposed to.

rohitgr7 commented 3 years ago

ok yeah. ignore the 2nd solution, it won't work with both logger and model_checkpoint.

TezRomacH commented 3 years ago

I would like to contribute the fix at least for the second feature. As far as I can see it will be enough to add character escaping to def format_checkpoint_name method in ModelCheckpoint

TezRomacH commented 3 years ago

just some workarounds:

  1. Log them separately, and lightning won't alter the keys
def training_step(self, batch, batch_idx):
    loss = self(batch).sum()
    self.log("loss_step/train", loss, on_step=True, on_epoch=False)
    self.log("loss_epoch/train", loss, on_step=False, on_epoch=True)
    return loss

def validation_step(self, batch, batch_idx):
    loss = self(batch).sum()
    self.log("loss_step/valid", loss, on_step=True, on_epoch=False)
    self.log("loss_epoch/valid", loss, on_step=False, on_epoch=True)

Will it work the same way if I log not a loss aka torch.Tensor or float, but torchmetrics.Metric instance (see the screenshot)? Will the underlying logic work fine with update(), compute(), reset() with this duplicated logging?

Screenshot 2021-10-25 at 14 06 14
rohitgr7 commented 3 years ago

I think it should but not 100% sure. There might be some re-computations though internally.

rohitgr7 commented 3 years ago

for the second case maybe this can work

class CustomModelCheckpoint(pl.callbacks.ModelCheckpoint):
    def format_checkpoint_name(self, *args, **kwargs):
        filepath = super().format_checkpoint_name(*args, **kwargs)
        filepath = filepath.replace(# replace `/` with something else)
        return filepath

wdyt?

TezRomacH commented 3 years ago

wdyt?

I need to check this in my pipeline


hmm, could we maybe add a new parameter escaped_chars: Optional[List[str]] = None to pl.callbacks.ModelCheckpoint?

by default, it will have the value None and no chars will be escaped, so the code will remain backward-compatible.