Closed Dewald928 closed 4 years ago
@Dewald928 Thanks for reporting this.
After a quick run on that notebook ("examples/lrfinder_mnist.ipynb"), I found that running range_test()
with val_loader
is as quickly as running without val_loader
, which is absolutely abnormal. I'll keep investigating it.
UPDATE: Currently, it seems there is something wrong in commit 52c189a.
OK, I figure out the reason why range_test()
is running as quickly as it runs without a val_loader
.
In range_test()
, the following loop works normally at the first iteration:
# @LRFinder.range_test()
# ...
for iteration in tqdm(range(num_iter)):
# Train on batch and retrieve loss
loss = self._train_batch(
train_iter,
accumulation_steps,
non_blocking_transfer=non_blocking_transfer,
)
if val_loader:
loss = self._validate(
val_iter, non_blocking_transfer=non_blocking_transfer
)
# ...
However, val_iter._iterator
has run out of values after that iteration and won't be reset in the following execution. Hence that self._validate()
won't do anything and just return the default output: 0.0 (known as running_loss
in that method). Therefore, loss
returned by self._train_batch()
is overwritten by 0.0, and it will be re-calculated by the following code:
# Track the best loss and smooth it if smooth_f is specified
if iteration == 0:
self.best_loss = loss
else:
if smooth_f > 0:
loss = smooth_f * loss + (1 - smooth_f) * self.history["loss"][-1]
if loss < self.best_loss:
self.best_loss = loss
And that's why the lr-loss curve goes flat like the result you provided.
I'll make a patch for it later.
Hey, man. Any progress on this issue?
@NaleRaphael fixed te issue in #60 which has been merged to the master branch.
I'm closing this issue, thanks for reporting it and feel free to reopen if needed
Yeah, that was my bad, didn't see those commits. Updated the package, now the evaluation with valid loader takes much longer to finish, so I guess now it works.
I copied your example notebook to colab and ran the code without changing anything. But the validation loss I get goes flat, which is clearly a mistake when compared to your example. I also experienced this with my other networks which do the same, the loss just goes flat.
You can see my results from colab and your example in the figures below.
EDIT: If I replace
val_iter
withval_loader
insideloss = self._validate(...)
it does seem to "work" as I'd expect. So somewhere there seems to be a mistake in how the val_iter is iterated.