davidtvs / pytorch-lr-finder

A learning rate range test implementation in PyTorch
MIT License
912 stars 116 forks source link

Validation loader flat loss #59

Closed Dewald928 closed 4 years ago

Dewald928 commented 4 years ago

I copied your example notebook to colab and ran the code without changing anything. But the validation loss I get goes flat, which is clearly a mistake when compared to your example. I also experienced this with my other networks which do the same, the loss just goes flat.

You can see my results from colab and your example in the figures below.

EDIT: If I replace val_iter with val_loader inside loss = self._validate(...) it does seem to "work" as I'd expect. So somewhere there seems to be a mistake in how the val_iter is iterated.

Colab Your Example notebook
colab notebook
NaleRaphael commented 4 years ago

@Dewald928 Thanks for reporting this.

After a quick run on that notebook ("examples/lrfinder_mnist.ipynb"), I found that running range_test() with val_loader is as quickly as running without val_loader, which is absolutely abnormal. I'll keep investigating it.

UPDATE: Currently, it seems there is something wrong in commit 52c189a.

NaleRaphael commented 4 years ago

OK, I figure out the reason why range_test() is running as quickly as it runs without a val_loader.

In range_test(), the following loop works normally at the first iteration:

# @LRFinder.range_test()
# ...
for iteration in tqdm(range(num_iter)):
    # Train on batch and retrieve loss
    loss = self._train_batch(
        train_iter,
        accumulation_steps,
        non_blocking_transfer=non_blocking_transfer,
    )
    if val_loader:
        loss = self._validate(
            val_iter, non_blocking_transfer=non_blocking_transfer
        )
# ...

However, val_iter._iterator has run out of values after that iteration and won't be reset in the following execution. Hence that self._validate() won't do anything and just return the default output: 0.0 (known as running_loss in that method). Therefore, loss returned by self._train_batch() is overwritten by 0.0, and it will be re-calculated by the following code:

# Track the best loss and smooth it if smooth_f is specified
if iteration == 0:
    self.best_loss = loss
else:
    if smooth_f > 0:
        loss = smooth_f * loss + (1 - smooth_f) * self.history["loss"][-1]
    if loss < self.best_loss:
        self.best_loss = loss

And that's why the lr-loss curve goes flat like the result you provided.

I'll make a patch for it later.

ivanpanshin commented 4 years ago

Hey, man. Any progress on this issue?

davidtvs commented 4 years ago

@NaleRaphael fixed te issue in #60 which has been merged to the master branch.

I'm closing this issue, thanks for reporting it and feel free to reopen if needed

ivanpanshin commented 4 years ago

Yeah, that was my bad, didn't see those commits. Updated the package, now the evaluation with valid loader takes much longer to finish, so I guess now it works.