Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.93k stars 3.34k forks source link

Change in logging frequency for automated logging #16821

Open nicoloesch opened 1 year ago

nicoloesch commented 1 year ago

Bug description

Utilising Automated Logging with self.log and self.log_dict as described in the documentation results in a shift of the logging frequency after various amounts of steps.

change_log_freq

This is also observed in all train metrics but only in the _step. It could be mutually exclusive to the WandbLogger but has already been observed to some extent in the TensorboardLogger as reported in #13525 and more generally in #10436.

How to reproduce the bug

Call self.log(, on_step=True, on_epoch=True) in training_step and let it run for more than 15k steps (in my case). The logging rate initial is equal to log_every_n_steps=50 for some iterations but jumps wildly around for others.

batch_size=10 (specified in self.log(batch_size=10)) test_subjects=20 samples_per_subject=10

This equals to 200 samples per epoch and 20 steps per epoch. Even if log_every_n_steps=50, this should log then precisely every 100 steps (as the first 50 is not met according to my understanding) and not jump around from 2 to 200.

Environment

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

baskrahmer commented 1 year ago

@nicoloesch could you provide a minimum working example? I was not able to reproduce the issue with your settings. Specifically, the following code works as expected (i.e. not showing any jump at around ~12k steps):

    class CustomModel(BoringModel):
        def training_step(self, batch, batch_idx):
            loss = self.step(batch)
            self.log("train_loss", loss, batch_size=10, on_step=True, on_epoch=True)
            return {"loss": loss}

    logger = WandbLogger()
    model = CustomModel()
    trainer = Trainer(max_steps=15000, log_every_n_steps=50, logger=logger, default_root_dir=tmpdir)
    trainer.fit(model)

image

My suspicion is that there is an interaction with the size of the training and validation sets, so maybe this could be included.

nicoloesch commented 1 year ago

@baskrahmer Thanks for the reply! I have recently changed over to the automated self.log in combination with the torchmetrics implementation and also WandbLogger as opposed to TensorboardLogger. For background, initially I called the functional interface of torchmetrics and created my own dictionary that is logged - most likely, that is where the mistake is arising from (?). Since I changed over to the fully automatic logging with the class interface of torchmetrics, I have not observed the behaviour of my initial bug report. I am not sure if you want to keep the issue open for some time (maybe a month as I currently use it almost daily) and I comment under it as soon as I observe it again, or if you/I close this issue and I open a new one with a working example if it occurs again?

Cheers, Nico