Assuming reduction method set to 'mean'(default) for loss function and drop_last set to False(default) for DataLoader, calling loss.backward() without accounting for potentially smaller size of the final batch could yield disproportionate effects on parameter update (and more generally, the optimizer behaviour).
Proposed solutions:
Multiply loss by x.size(0) / val_loader.batch_size prior to calling loss.backward() (gradient computation). However, this would result in this factor being equal to 1, i.e useless, for all-but-one batches.
Apply an opposite parameter update as a compensation to the final parameter update, after the batch-iterating loop. However, for optimizers not as trivial as SGD, calling optimizer.step() implies more than just updating parameters based on gradient.
Require drop_last to be True, bypassing the issue.
Require reduction method to be set to 'sum'. This, however, is a highly inconvenient solution as it would make learning rate and batch size parameterization to be co-dependent. That's why 'mean' is usually preferred for easier parameterization.
File: train.py Function: train_one_epoch(...)
Description:
Assuming reduction method set to
'mean'
(default) for loss function anddrop_last
set toFalse
(default) forDataLoader
, callingloss.backward()
without accounting for potentially smaller size of the final batch could yield disproportionate effects on parameter update (and more generally, the optimizer behaviour).Proposed solutions:
loss
byx.size(0) / val_loader.batch_size
prior to callingloss.backward()
(gradient computation). However, this would result in this factor being equal to 1, i.e useless, for all-but-one batches.optimizer.step()
implies more than just updating parameters based on gradient.drop_last
to beTrue
, bypassing the issue.'sum'
. This, however, is a highly inconvenient solution as it would make learning rate and batch size parameterization to be co-dependent. That's why'mean'
is usually preferred for easier parameterization.