Closed kapsner closed 3 years ago
Hi! thanks for your contribution!, great first issue!
Thanks for sharing this post (and sorry for missing it beforehand).
Do I understand it correctly, that the loss calculated by weighted mean from steps (train_loss_epoch
from the forum link) is 'more trustworthy' than the unweighted avg_loss from epoch_ends?
ideally yes.. let's say batch_sizes are [64, 64, 64, 7]... then doing an average isn't actually correct.
I have now tried to reproduce the weighted averaging in {prefix}_epoch_ends, but I am still not able to reproduce the numbers that are logged from steps.
The boring model above was updated as follows to calculate weighted means on epoch ends:
def _shared_epoch_end(self, outputs, prefix):
# concat batch sizes
batch_sizes = torch.stack([torch.tensor(
x["batch_size"],
dtype=torch.int16
) for x in outputs])
# concat losses
losses = torch.stack([x["loss"] for x in outputs])
# calculating weighted average loss
avg_loss = torch.sum(losses * batch_sizes) / torch.sum(batch_sizes)
self.log(
name="loss/" + prefix,
value=avg_loss,
prog_bar=False,
logger=True,
on_step=False,
on_epoch=True
)
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.log(
name="train_loss",
value=loss,
prog_bar=False,
logger=True,
on_step=False,
on_epoch=True
)
return {"loss": loss, "batch_size": batch.shape[0]}
def training_step_end(self, training_step_outputs):
return training_step_outputs
def training_epoch_end(self, outputs) -> None:
self._shared_epoch_end(outputs, "train")
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.log(
name="valid_loss",
value=loss,
prog_bar=False,
logger=True,
on_step=False,
on_epoch=True
)
return {"loss": loss, "batch_size": batch.shape[0]}
def validation_epoch_end(self, outputs) -> None:
self._shared_epoch_end(outputs, "valid")
def test_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"loss": loss, "batch_size": batch.shape[0]}
def test_epoch_end(self, outputs) -> None:
self._shared_epoch_end(outputs, "test")
However, losses calculated from {prefix}_step and {prefix}_epoch_end still differ slightly:
@kapsner : Is there any random state
which may cause this subtle differences? just a thought.
Well, in the boring model, seed_everything
and deterministic=TRUE
is set, so I assume, there should be no randomness. Maybe I am overseeing something
@kapsner mind share a reproducible colab notebook with updated code? will check.
couldn't find the solution but after experimenting I found that the issue is with torch.dot
here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/6926b849372fe8f7bc6d5fa8c9eb3ba856645534/pytorch_lightning/core/step_result.py#L1077
if you do (result.float() * weights.float()).sum()
which is equivalent to torch.dot
here, then it's giving the correct result up to last decimal, but with torch.dot
there is some issue. Tried this solution too but it's not working for me.
I suppose torch.dot flattens both tensors before doing the product. Is there something left to fix or was it resolved now by reproducing the weighted average?
Hey @awaelchli @rohitgr7,
Batch_size is attached there: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py#L341
I don't remember adding a test for weighted average. We should add a test to make sure it works fine.
Best, T.C
there is no problem with batch_size
here... the problem is with dot product torch.dot
as I explained above.. most of the time it's correct but sometimes the results are not correct upto last precision. For eg. if your batch_size=1
then torch.dot(losses, batches) == losses.sum()
(should be)... but sometimes results with torch.dot
are like 12345.0
which actually should be 12345.25000
(torch.sum)... which leads to these minor differences in the final results.
Hey @rohitgr7,
What I meant is: We should add a test for make sure we properly reduced with weighted_mean independently of torch.dot
.
Best, T.C
@tchaton even if we do that with some random tensors maybe, then still we can't ensure that it will pass all the time because it just happens very rarely. Maybe we should not use torch.dot
at all in weighted_mean
function? WDYT?
🐛 Bug
When logging losses from {prefix}_step with
self.log("{prefix}_loss", loss, on_step=False, on_epoch=True)
, they are different from losses logged in {prefix}_epoch_end, usingPlease reproduce using the BoringModel
https://colab.research.google.com/drive/1Sz9kgGuMWxcAPOZ7SPcm4XYhsoWgwOFt?usp=sharing
To Reproduce
Run the code from the link (I copied it into a .py-file and ran it from commandline) and see csv-file with logged losses (under ./default/version_{int}/metrics.csv).
Expected behavior
{prefix}_loss (logged from {prefix}_step) and loss/{prpefix} (logged from {prefix}_epoch_end) should be equal.
Environment
Additional context
The differences in losses seem marginally using this boring model, however, to me it is unclear, why this happens. From my understanding, there should be no differences at all when logging losses from steps with
on_epoch=True
or from epoch_ends.