Open nicoloesch opened 1 year ago
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
@nicoloesch could you provide a minimum working example? I was not able to reproduce the issue with your settings. Specifically, the following code works as expected (i.e. not showing any jump at around ~12k steps):
class CustomModel(BoringModel):
def training_step(self, batch, batch_idx):
loss = self.step(batch)
self.log("train_loss", loss, batch_size=10, on_step=True, on_epoch=True)
return {"loss": loss}
logger = WandbLogger()
model = CustomModel()
trainer = Trainer(max_steps=15000, log_every_n_steps=50, logger=logger, default_root_dir=tmpdir)
trainer.fit(model)
My suspicion is that there is an interaction with the size of the training and validation sets, so maybe this could be included.
@baskrahmer Thanks for the reply! I have recently changed over to the automated self.log
in combination with the torchmetrics
implementation and also WandbLogger
as opposed to TensorboardLogger
. For background, initially I called the functional interface of torchmetrics
and created my own dictionary that is logged - most likely, that is where the mistake is arising from (?). Since I changed over to the fully automatic logging with the class interface of torchmetrics
, I have not observed the behaviour of my initial bug report. I am not sure if you want to keep the issue open for some time (maybe a month as I currently use it almost daily) and I comment under it as soon as I observe it again, or if you/I close this issue and I open a new one with a working example if it occurs again?
Cheers, Nico
Bug description
Utilising Automated Logging with
self.log
andself.log_dict
as described in the documentation results in a shift of the logging frequency after various amounts of steps.This is also observed in all train metrics but only in the
_step
. It could be mutually exclusive to theWandbLogger
but has already been observed to some extent in theTensorboardLogger
as reported in #13525 and more generally in #10436.How to reproduce the bug
Call
self.log(, on_step=True, on_epoch=True)
intraining_step
and let it run for more than 15k steps (in my case). The logging rate initial is equal tolog_every_n_steps=50
for some iterations but jumps wildly around for others.batch_size=10
(specified inself.log(batch_size=10)
)test_subjects=20
samples_per_subject=10
This equals to 200 samples per epoch and 20 steps per epoch. Even if
log_every_n_steps=50
, this should log then precisely every 100 steps (as the first 50 is not met according to my understanding) and not jump around from 2 to 200.Environment