Open TezRomacH opened 3 years ago
hi I would like to try this issue
just some workarounds:
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("loss_step/train", loss, on_step=True, on_epoch=False)
self.log("loss_epoch/train", loss, on_step=False, on_epoch=True)
return loss
def validation_step(self, batch, batch_idx): loss = self(batch).sum() self.log("loss_step/valid", loss, on_step=True, on_epoch=False) self.log("loss_epoch/valid", loss, on_step=False, on_epoch=True)
2. Configure log_metrics inside CustomLogger
```py
class CustomTensorboardLogger(pl.loggers.TensorBoardLogger):
def log_metrics(self, metrics, step):
metrics = .... <- update keys here
super().log_metrics(metrics, step)
trainer = Trainer(..., logger=CustomTensorboardLogger(save_dir='.', name='lightning_logs'))
@rohitgr7 I like this! Very clever solutions :)
But hopefully, it will be possible to do it out of the box in the pl :)
- Configure log_metrics inside CustomLogger
class CustomTensorboardLogger(pl.loggers.TensorBoardLogger): def log_metrics(self, metrics, step): metrics = .... <- update keys here super().log_metrics(metrics, step) trainer = Trainer(..., logger=CustomTensorboardLogger(save_dir='.', name='lightning_logs'))
Did you mean custom ModelCheckpoint
here? Because it's the ModelCheckpoint that does not save the file exactly as it is supposed to.
ok yeah. ignore the 2nd solution, it won't work with both logger and model_checkpoint.
I would like to contribute the fix at least for the second feature. As far as I can see it will be enough to add character escaping to def format_checkpoint_name
method in ModelCheckpoint
just some workarounds:
- Log them separately, and lightning won't alter the keys
def training_step(self, batch, batch_idx): loss = self(batch).sum() self.log("loss_step/train", loss, on_step=True, on_epoch=False) self.log("loss_epoch/train", loss, on_step=False, on_epoch=True) return loss def validation_step(self, batch, batch_idx): loss = self(batch).sum() self.log("loss_step/valid", loss, on_step=True, on_epoch=False) self.log("loss_epoch/valid", loss, on_step=False, on_epoch=True)
Will it work the same way if I log not a loss
aka torch.Tensor
or float
, but torchmetrics.Metric
instance (see the screenshot)? Will the underlying logic work fine with update()
, compute()
, reset()
with this duplicated logging?
I think it should but not 100% sure. There might be some re-computations though internally.
for the second case maybe this can work
class CustomModelCheckpoint(pl.callbacks.ModelCheckpoint):
def format_checkpoint_name(self, *args, **kwargs):
filepath = super().format_checkpoint_name(*args, **kwargs)
filepath = filepath.replace(# replace `/` with something else)
return filepath
wdyt?
wdyt?
I need to check this in my pipeline
hmm, could we maybe add a new parameter escaped_chars: Optional[List[str]] = None
to pl.callbacks.ModelCheckpoint
?
by default, it will have the value None
and no chars will be escaped, so the code will remain backward-compatible.
š Feature
Motivation
With Lightning, I use Tensorboard and ClearML, which allow you to group metrics in one tab. The "/" is used for grouping. In the screenshots from tb and ClearML you can see how they are grouped now.
Pitch
The first is that when you select
on_step=True
andon_epoch=True
, a postfix is appended to the end and the metrics by epoch and step are grouped into one in ClearML. I would like to split this group into two:loss_step/<train or valid>
andloss_epoch/<train or valid>
. Could the name parameter in self.log be formatted? What do you think? This would be something likename="loss_{interval}/valid"
.The second problem also involves formatting. When you pass a grouped metric to filename in ModelCheckpoint, Lightning will think "/" means to create a folder :slightly_smiling_face: See screenshot. Would it be possible to add an escaping for slash and replace it with a lower underscore?
Alternatives
Additional context
Similar problem, but with W&B in Lightning-Flash https://github.com/PyTorchLightning/lightning-flash/issues/818
If you enjoy Lightning, check out our other projects! ā”
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.