Why are losses different when logging from '_step' (with on_epoch=True) compared to logging from '_epoch_end'?

kapsner commented 3 years ago

🐛 Bug

When logging losses from {prefix}_step with self.log("{prefix}_loss", loss, on_step=False, on_epoch=True), they are different from losses logged in {prefix}_epoch_end, using

avg_loss = torch.stack([x["loss"] for x in outputs]).mean()
self.log(
      name="loss/" + prefix,
      value=avg_loss,
      prog_bar=False,
      logger=True,
      on_step=False,
      on_epoch=True
  )

grafik

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1Sz9kgGuMWxcAPOZ7SPcm4XYhsoWgwOFt?usp=sharing

To Reproduce

Run the code from the link (I copied it into a .py-file and ran it from commandline) and see csv-file with logged losses (under ./default/version_{int}/metrics.csv).

Expected behavior

{prefix}_loss (logged from {prefix}_step) and loss/{prpefix} (logged from {prefix}_epoch_end) should be equal.

Environment

CUDA:
- GPU:
- available: False
- version: 10.2
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.1.4
- tqdm: 4.56.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.5
- version: #1 SMP Tue Jun 23 12:58:10 UTC 2020

Additional context

The differences in losses seem marginally using this boring model, however, to me it is unclear, why this happens. From my understanding, there should be no differences at all when logging losses from steps with on_epoch=True or from epoch_ends.

github-actions[bot] commented 3 years ago

Hi! thanks for your contribution!, great first issue!

rohitgr7 commented 3 years ago

check the answer here: https://forums.pytorchlightning.ai/t/understanding-different-values-of-training-validation-loss-in-callback-metrics-dictionary/568/2?u=goku

kapsner commented 3 years ago

Thanks for sharing this post (and sorry for missing it beforehand).

Do I understand it correctly, that the loss calculated by weighted mean from steps (train_loss_epoch from the forum link) is 'more trustworthy' than the unweighted avg_loss from epoch_ends?

rohitgr7 commented 3 years ago

ideally yes.. let's say batch_sizes are [64, 64, 64, 7]... then doing an average isn't actually correct.

kapsner commented 3 years ago

I have now tried to reproduce the weighted averaging in {prefix}_epoch_ends, but I am still not able to reproduce the numbers that are logged from steps.

The boring model above was updated as follows to calculate weighted means on epoch ends:

def _shared_epoch_end(self, outputs, prefix):
    # concat batch sizes
    batch_sizes = torch.stack([torch.tensor(
        x["batch_size"],
        dtype=torch.int16
    ) for x in outputs])

    # concat losses
    losses = torch.stack([x["loss"] for x in outputs])

    # calculating weighted average loss
    avg_loss = torch.sum(losses * batch_sizes) / torch.sum(batch_sizes)

    self.log(
        name="loss/" + prefix,
        value=avg_loss,
        prog_bar=False,
        logger=True,
        on_step=False,
        on_epoch=True
    )

def training_step(self, batch, batch_idx):
    output = self.layer(batch)
    loss = self.loss(batch, output)
    self.log(
        name="train_loss",
        value=loss,
        prog_bar=False,
        logger=True,
        on_step=False,
        on_epoch=True
    )
    return {"loss": loss, "batch_size": batch.shape[0]}

def training_step_end(self, training_step_outputs):
    return training_step_outputs

def training_epoch_end(self, outputs) -> None:
    self._shared_epoch_end(outputs, "train")

def validation_step(self, batch, batch_idx):
    output = self.layer(batch)
    loss = self.loss(batch, output)
    self.log(
        name="valid_loss",
        value=loss,
        prog_bar=False,
        logger=True,
        on_step=False,
        on_epoch=True
    )
    return {"loss": loss, "batch_size": batch.shape[0]}

def validation_epoch_end(self, outputs) -> None:
    self._shared_epoch_end(outputs, "valid")

def test_step(self, batch, batch_idx):
    output = self.layer(batch)
    loss = self.loss(batch, output)
    return {"loss": loss, "batch_size": batch.shape[0]}

def test_epoch_end(self, outputs) -> None:
    self._shared_epoch_end(outputs, "test")

However, losses calculated from {prefix}_step and {prefix}_epoch_end still differ slightly: grafik

msank00 commented 3 years ago

@kapsner : Is there any random state which may cause this subtle differences? just a thought.

kapsner commented 3 years ago

Well, in the boring model, seed_everything and deterministic=TRUE is set, so I assume, there should be no randomness. Maybe I am overseeing something

rohitgr7 commented 3 years ago

@kapsner mind share a reproducible colab notebook with updated code? will check.

kapsner commented 3 years ago

@rohitgr7 Thanks in advance, I have updated the notebook accordingly.

rohitgr7 commented 3 years ago

couldn't find the solution but after experimenting I found that the issue is with torch.dot here: https://github.com/PyTorchLightning/pytorch-lightning/blob/6926b849372fe8f7bc6d5fa8c9eb3ba856645534/pytorch_lightning/core/step_result.py#L1077

if you do (result.float() * weights.float()).sum() which is equivalent to torch.dot here, then it's giving the correct result up to last decimal, but with torch.dot there is some issue. Tried this solution too but it's not working for me.

awaelchli commented 3 years ago

I suppose torch.dot flattens both tensors before doing the product. Is there something left to fix or was it resolved now by reproducing the weighted average?

tchaton commented 3 years ago

Hey @awaelchli @rohitgr7,

Batch_size is attached there: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py#L341

I don't remember adding a test for weighted average. We should add a test to make sure it works fine.

Best, T.C

rohitgr7 commented 3 years ago

there is no problem with batch_size here... the problem is with dot product torch.dot as I explained above.. most of the time it's correct but sometimes the results are not correct upto last precision. For eg. if your batch_size=1 then torch.dot(losses, batches) == losses.sum() (should be)... but sometimes results with torch.dot are like 12345.0 which actually should be 12345.25000(torch.sum)... which leads to these minor differences in the final results.

tchaton commented 3 years ago

Hey @rohitgr7,

What I meant is: We should add a test for make sure we properly reduced with weighted_mean independently of torch.dot.

Best, T.C

rohitgr7 commented 3 years ago

@tchaton even if we do that with some random tensors maybe, then still we can't ensure that it will pass all the time because it just happens very rarely. Maybe we should not use torch.dot at all in weighted_mean function? WDYT?

Lightning-AI / pytorch-lightning