davidtvs / pytorch-lr-finder

A learning rate range test implementation in PyTorch
MIT License
911 stars 116 forks source link

Lr-finder with multiple inputs, outputs and losses #78

Closed YashRunwal closed 7 months ago

YashRunwal commented 3 years ago

Hello,

Firstly, thank you for this wonderful library. I have a model which expects 2 inputs. I am working with 2 kinds of images, one of size (512, 1536) and the other of size (128, 384). Therefore, my train_loader contains 2 inputs and one target of shape (128, 384, 16). My model has 4 prediction heads and hence is trained using 4 losses for different purposes.

So my collate_fn for the data loader looks like this:

def detection_collate(batch):
    """Custom collate fn for dealing with batches of images that have a different
    number of associated object annotations (bounding boxes).
    Arguments:
        batch: (tuple) A tuple of tensor images and lists of annotations
    Return:
        A tuple containing:
            1) (tensor) batch of images stacked on their 0 dim
            2) (list of tensors) annotations for a given image are stacked on
                                 0 dim
    """
    targets = []
    imgs = []
    deps = []
    for sample in batch:
        imgs.append(sample[0])
        deps.append(sample[1])
        targets.append(sample[2])
    return torch.stack(imgs, 0), torch.stack(deps, 0), torch.stack(targets, 0)

As mentioned, there are 4 different losses: Custom Heatmap (Focal) loss, SmoothL1, SmoothL1, BCE loss.

The forward method of the model expects 2 inputs. A small snippet is shown below:

 def forward(self, x, dep=None, target=None):
        # Backbone: ResNet18, x is image size: (512, 1536)

Here, targets are the labels so to say.

In this case, how do I go about finding the best learning rate using lr-finder? Notably, I can only use batch_size=2 because of the computational limitations.

NaleRaphael commented 3 years ago

Hi @YashRunwal!

In this scenario, maybe you can try to use lr-finder with gradient accumulation to simulate a larger batch for training. This functionality has been integrated in this library with the nvidia/apex, you can try it out. (We have got a pull request for migrating the mixed precision training and gradient accumulation functionalities from nvidia/apex to the pytorch.amp, but I'm sorry about that I don't have enough time to continuing working on it recently.)

Regarding the setup for data loading with multiple inputs and back-propagating multiple losses, you can checkout this comment to see whether it does help. The basic idea is "creating a wrapper to deal with custom input".

And if there is anything confusing you or you have further problem on implementing it, please feel free to let me know!

YashRunwal commented 3 years ago

I have used gradient accumulation. I backpropagated the gradients after 64 steps (simulating 64 batch size). But let me check out how to use lr-finder with this. I will get back to you in case I need any help. Thanks for replying promptly. I really appreciate it.

YashRunwal commented 3 years ago

@NaleRaphael Hi, So I followed that link and this. The link you had mentioned only had one loss function in the model and hence I checked a few issues and stumbled upon the aforementioned link. So a snippet of the training script looks like this:

def train_snippet():
  for epoch in range(0, max_epochs):
     for iter_i, (images, dep_images, targets) in enumerate(train_loader):
         cls_loss, txty_loss, twth_loss, _, _, dep_loss = model(images, dep_images, target=targets)

Can I just use this: total_loss = cls_loss + txty_loss + twth_loss + dep_loss and then use this total_loss as criterion and pass it to the AccumulationLRFinder?

criterion= total_loss
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=5e-4)

lr_finder = AccumulationLRFinder(
    model, optimizer, criterion, device="cuda", 
    accumulation_steps=accumulation_steps
)

Or is it important to create a custom Loss wrapper? In my case, it is very difficult as it depends on outputs of the backbone (I am using an Encoder Decoder Network) and other variables too.

NaleRaphael commented 3 years ago

Can I just use this: total_loss = cls_loss + txty_loss + twth_loss + dep_loss and then use this total_loss as criterion and pass it to the AccumulationLRFinder?

~Yes, you can do that, it basically works like the custom wrapper for multiple loss functions.~

(updated) Oh, sorry, I get it wrong. You might still need to create a class as the wrapper for multiple losses. Since the criterion passed to AccumulationLRFinder should be a torch.nn.module instance in which there is a forward() or __call__() protocol to be invoked, currently I cannot figure out is there any alternative approach that would enable us to just pass a calculated loss into it.

But there is one thing you might have to check: if there would be significant differences among those losses, you might also want to add some weights for them to ensure the overall loss won't be dominated by one of them all the time.

YashRunwal commented 3 years ago

Hi,

OKay, I will try to create the wrapper. I have to make some changes to the model as well I think. Currently, the model returns losses, and now for this wrapper, it should return the prediction heads which can be passed to this wrapper to calculate losses. I will do this and get back to you tomorrow I think.

You are right, one loss (txty loss) dominates. How do I add some weights to the losses though?

NaleRaphael commented 3 years ago

Regarding the weights for losses, you might need to decide them after knowing possible value range of each loss functions. Otherwise, it would be hard to choose proper ones. After knowing the value ranges, you can try some weighting approach like: "assigning coefficients to each loss, but keep the sum of those coefficients being 1". It would be a similar concept as weighted binary cross-entropy loss function.

YashRunwal commented 3 years ago

@NaleRaphael Hi,

Sorry I couldn't reply yesterday. I was busy doing something else. But I have made changes to the model, made a custom Loss wrapper, but now the problem is with the ModelWrapper class.

class MyTrainDataLoaderIter(TrainDataLoaderIter):
    def inputs_labels_from_batch(self, batch_data):
        # Batch contains: sig, dep, target
        *desired_data, target = batch_data   # desired_data: sig, dep
        return desired_data, target

class ModelWrapper(nn.Module):
    def __init__(self, model):
        super(ModelWrapper, self).__init__()
        self.model = model

    def forward(self, data):
        sig, dep, target = data
        return self.model(sig, dep, target)

As seen from the detection_collatefunction in the question, I have 3 inputs to the model. When I check the output of the MyTrainDataLoaderIterclass as shown below, the output is correct.

test = next(iter(MyTrainDataLoaderIter ())
len(test)  # 2
len(test[0])  # 2  # sig, dep
len(test[1])  # 1 # target

But in the ModelWrapper class, when I check the length of data len(data), it should contain 2 variables *desired_data, target as is the output of the custom dataloader as shown above. However, it contains only sig and dep as mentioned in the forward function of the ModelWrapper and doesn't contain the variable target. My model expects 3 inputs: sig, dep, target. The output of the DataLoader is correct but the datavariable of ModelWrapper isn't.

Any ideas?

NaleRaphael commented 3 years ago

Hi @YashRunwal, no worry, just take your time!

Regarding the situation you mentioned, is that means the model would perform some operations with the target that would also be used to calculate loss after model.forward() being invoked? If so, maybe you can try to duplicate the reference of target variable in the method inputs_labels_from_batch to make desired_data also contains target. i.e.

class MyTrainDataLoaderIter(TrainDataLoaderIter):
    def inputs_labels_from_batch(self, batch_data):
        # Batch contains: sig, dep, target
        *desired_data, target = batch_data   # desired_data: sig, dep

        # Keep the output contains only 2 elements.
        # In this case, this output would be `(sig, dep, target), target`
        return (*desired_data, target), target

# Or in a simpler way:
class MyTrainDataLoaderIter(TrainDataLoaderIter):
    def inputs_labels_from_batch(self, batch_data):
        *desired_data, target = batch_data   # desired_data: sig, dep
        return batch_data, target

And note that the way we design the DataLoaderIter is trying to follow the convention of training pipeline in PyTorch, i.e.

for i, batch in enumerate(data_loader_iter):
    inputs, targets = batch
    outputs = model(inputs)
    loss = loss_func(outputs, targets)    # <- here

    loss.backward()
    optimizer.step()
    # ...

So that if it requires to calculate loss inside the model.forward(), you might want to avoid the loss being calculated again as it's pointed in the code snippet above. To achieve that, you can simply create a wrapper for loss function and let it do nothing in the forward().

YashRunwal commented 3 years ago

@NaleRaphael Thanks for replying so quickly. Yes, as you guessed, the loss is calculated inside the forward function of the model. So I created a wrapper for loss as shown below, however, I am still getting an error.

class CustomLossFunction(nn.Module):
    def __init(self):
        pass

    def forward(self, inputs, outputs):
        pass
def find_lr(device, desired_batch_size=32, real_batch_size=1):
    model = load_model()
    criterion = CustomLossFunction()

    # accumulation_steps = desired_batch_size // real_batch_size
    # lr_finder = AccumulationLRFinder(
    #     model, optimizer, criterion, device=device,
    #     accumulation_steps=accumulation_steps
    # )

    trainloader_wrapper = dataset_loader()
    model_wrapper = ModelWrapper(model).to(device)

    optimizer = optim.Adam(model_wrapper.parameters(),
                           lr=1e-3,
                           weight_decay=5e-4)
    lr_finder = LRFinder(model_wrapper.to(device), optimizer, criterion, device=device)
    lr_finder.range_test(trainloader_wrapper, end_lr=1, num_iter=10, step_mode='exp', start_lr=1e-5)

    lr_finder.plot()
    lr_finder.reset()

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
find_lr(device)

I am trying the lr_finderfunction first and later I will clone the Acc_grad branch to use AccumulationLRFinder. The error I am getting is:

Traceback (most recent call last):
  File "D:/FH-AACHEN/Thesis/labelImg_test_Annotation/src/lr_finder.py", line 128, in <module>
    find_lr(device)
  File "D:/FH-AACHEN/Thesis/labelImg_test_Annotation/src/lr_finder.py", line 121, in find_lr
    lr_finder.range_test(trainloader_wrapper, end_lr=1, num_iter=10, step_mode='exp', start_lr=1e-5)
  File "F:\anaconda\envs\detectron2\lib\site-packages\torch_lr_finder\lr_finder.py", line 320, in range_test
    non_blocking_transfer=non_blocking_transfer,
  File "F:\anaconda\envs\detectron2\lib\site-packages\torch_lr_finder\lr_finder.py", line 381, in _train_batch
    loss /= accumulation_steps
TypeError: unsupported operand type(s) for /=: 'NoneType' and 'int'

But I am not sure why it is expecting me to use accumulation_steps. Notably, I am using a batch_size=1. I think I am missing something. I can't figure it out :(

Edit: I checked the source code and in the range_testfunction, I can see that the accumulation_stepsargument is set to default 1. I get the same error for batch_size=2. If I increase the batch size anymore, I run into a CUDA out of memory error :(

NaleRaphael commented 3 years ago

Oh, I forgot that the loss calculated after the model.forward() step will be required for calculating an accumulative loss inside the implementation of LRFinder._train_batch(). Since the wrapper for loss function returns nothing, it leads to the error TypeError: unsupported operand type(s) for /=: 'NoneType' and 'int' while it's trying to divide the loss (it's None here) with accumulation_steps. So that changing batch_size won't solve this issue.

In this case, the easier way to get it work might be creating a custom LRFinder class with rewriting the _train_batch() instead. And if I understand correctly, the following constraints are the ones we have to meet currently, right?

  1. loss will be calculated inside model.forward()
  2. loss will be returned from model.forward()

If so, you might want to rewrite _train_batch() as this:

def _train_batch(self, train_iter, accumulation_steps, non_blocking_transfer=True):
    # ...
    for i in range(accumulation_steps):
        inputs, labels = next(train_iter)
        inputs, labels = self._move_to_device(
            inputs, labels, non_blocking=non_blocking_transfer
        )

        outputs = self.model(inputs)    # assume that `loss` will be included in `outputs`

        # ----- no need to change the code above -----
        # Loss calculation by the loss function outside the `model.forward()` is removed here.

        # Unpack outputs to retrieve loss here
        # Note that the left-hand side values depend on the order of output in your `model.forward()`
        # Here we assume that the `loss` calculated in `model.forward()` will be returned as the last variable
        *_, loss = outputs

        # ----- no need to change the following code -----

        # Loss should be averaged in each step
        loss /= accumulation_steps
        # ...

But if loss.backward() will be invoked inside model.forward(), things will get more complicated since there are some code for the gradient accumulation mechanism inside _train_batch(). Generally, we will suggest you move the code for loss calculation outside from the model.forward() since it's usually a good practice to leave the model responsible for predicting only (as it will be used in inference mode after a model is trained).

YashRunwal commented 3 years ago

Yes currently I have the following inside the forward function of the model:

# All losses
            if not self.custom:
                cls_loss, txty_loss, twth_loss, iou_loss, iou_aware_loss = loss.loss(
                    pred_cls=cls_pred,
                    pred_txty=txty_pred,
                    pred_twth=twth_pred,
                    pred_iou=iou_pred,
                    pred_iou_aware=iou_aware_pred,
                    label=target,
                    num_classes=self.num_classes
                )
                return cls_loss, txty_loss, twth_loss, iou_loss, iou_aware_loss

i.e. I have a separate script to calculate the losses based on the heads and then this loss is returned in the forward function.

In the same function (forward), I have a boolean called 'trainable'. This self.trainable variable is set to False during model.eval() mode when I train the model. Therefore with this argument being set to either True or False, I can get the losses during the training stage which are then backpropagated, or the predictions during the eval stage which are used to calculate the mAP.

For example, this is a small snippet from the training script:

 for iter_i, (images, dep_img, targets) in enumerate(train_loader):
            # images = images.view(1, 3, 512, 1536)
            # Warmup strategy
            if warmup:
                if epoch < 1:
                    tmp_lr = base_lr * pow((iter_i + epoch * epoch_size) * 1. / (1 * epoch_size), 4)
                    set_lr(optimizer, tmp_lr)
            images = images.to(device)
            targets = targets.to(device)
            dep_img = dep_img .to(device)

            # Freeze BN Layers
            model.apply(freeze_BN_layers)

            # Forward function and compute losses
            cls_loss, txty_loss, twth_loss, iou_loss, iou_aware_loss, depth_loss = model(images, dep_img ,
                                                                                         target=targets)

So considering your advice, I wrote this:

class CustomLrFinder(LRFinder):
    def __init__(self,
                 model,
                 optimizer,
                 criterion,
                 device=None,
                 memory_cache=True,
                 cache_dir=None, ):

        self.model = model
        self.optimizer = optimizer
        self.criterion = criterion
        self.device = device
        self.memory_cache = memory_cache
        self.cache_dir = cache_dir

    def _train_batch(self, train_iter, accumulation_steps, non_blocking_transfer=True):
        self.model.train()
        total_loss = None  # for late initialization

        self.optimizer.zero_grad()
        for i in range(accumulation_steps):
            inputs, labels = next(train_iter)
            inputs, labels = self._move_to_device(
                inputs, labels, non_blocking=non_blocking_transfer
            )

            # Forward pass
            outputs = self.model(inputs)
            # loss = self.criterion(outputs, labels)

            cls_loss, txty_loss, twth_loss, iou_loss, iou_aware_loss, depth_loss = outputs

            # Loss should be averaged in each step
            total_loss = cls_loss + txty_loss + twth_loss + iou_loss + iou_aware_loss + depth_loss
            total_loss /= accumulation_steps

            # Backward pass

        return total_loss.item()

What do you think?

NaleRaphael commented 3 years ago

Yes, that should work. But you might also need to call loss.backward() after the loop ends inside _train_batch(), otherwise, weights in model won't be updated. Or loss.backward() will be also invoked in model.forward() with the argument trainable?

YashRunwal commented 3 years ago

No, it won't, you are correct. I have to call it here inside the wrapper.

Do I not need to add self.optimizer.step() at the end of the for loop?

One more thing: Do you remember we discussed about weighting the Loss function if one of the losses has more influence? For me the txty_loss has the most influence. I searched a lot about this but couldn't find anything and hence I opened an Issue (https://discuss.pytorch.org/t/adding-weights-to-the-losses/129908) but no luck as of yet.

Do you think it would be a huge issue if I don't weight my losses? :)

I will also try to find the lr and get back to you asap.

NaleRaphael commented 3 years ago

Got it, hope it works soon! And just take your time, we are not in a hurry :)

And yes, self.optimizer.step() should be added too. But self.optimizer.zero_grad() has been invoked in the first few lines in _train_batch(), so that you don't need to call it again. (You can imagine that _train_batch() is literally a function that organizes all necessary steps to deal with a batch of inputs)

As for the weighting for losses, I would said that once the overall loss is always dominated by one of them, that also means the other ones won't help the model to learn in the objectives given by those loss functions. Hence that I would consider that weighting for losses might be an issue you might also want to care about.

And since this kind of weighting is designed for different kinds of loss instead of a single loss with different targets (e.g. imbalanced classes for multi-class classification), you can just multiply those losses with some coefficients like:

total_loss = w_a*cls_loss + w_b*txty_loss + w_c*twth_loss + w_d*iou_loss + w_e*dep_loss

However, it really depends on the task you want to solve, and the way to determine the weights is also a thing that has to be experimented. The simplest way is to find out all possible value ranges of every losses, and use them to decide the weights, just like performing a normalization on each loss.

This is an interesting topic, maybe you might want to check out some papers related to multi-task learning or loss function design? :) e.g.

YashRunwal commented 3 years ago

@NaleRaphael So I tried num_iter=10 and I received a blank plot. So I increased the num_iter to 50 and I get the following: image

Doesn't this plot look a bit weird? Why is the loss on Y-axis so high?

NaleRaphael commented 3 years ago

The blank plot should be resulted by the default argument skip_start=10 of function LRFinder.plot(), it will plot the record by excluding the first 10 data points. See also: https://github.com/davidtvs/pytorch-lr-finder/blob/acc5e7ee7711a460bf3e1cc5c5f05575ba1e1b4b/torch_lr_finder/lr_finder.py#L440-L448

But in the figure you posted, there might be more points in front of the steepest point which are excluded before being plotted into the figure. You can set the argument skip_start=0 and re-plot it again to see whether the loss curve is actually steep at the first 10 points.

As for the high loss, you might have to check the model output each iteration. I guess that would also be affected with the lack of training warm-up while applying LRFinder on your model.

And in the LRFinder instance, there is a field called history (lr_finder.history), you can re-plot the figure by accessing it and check whether there is any further problem in the record.

NaleRaphael commented 3 years ago

And just for a guess, if the txty_loss you used is something related to pixel-level calculation, that might be the reason why it always dominate the overall loss. In this case, you might want to normalize it with the width and height of an image. Otherwise, it might be also a reason why you will get some high loss in the first few iterations.

YashRunwal commented 3 years ago

Wow, so much information to digest. I think I need some time to read everything and understand. Hahah! Thank you! Also,

  1. You guessed it correctly. The dataset is extremely unbalanced and I will definitely take a look at those links you have sent.
  2. I changed the skip_step to 0 and no, the loss curve isn't steep at the first 10 Epochs (plot below).
  3. txty_loss is related to pixels, yes. It regresses the center coordinates of a bounding box. It is actually normalized but I will take a look once again, thank you. I think I need to train with more samples. I have only 2700 training images. Will also try the warmup strategy. I am using this for training anyways.
  4. Does this repo help with understanding the number of epochs the model must be trained for? Because reading a few blogs and papers, it is the hyperparameter that frustrates me the most.
  5. I will also try to use Gradient Accumulation with LRFinder and then post the results here or in case of any issue, I might ask here.

image

Here I set the start_lr to 1e-6 and it recommends using LR=1.25e-6. When I set the start_lr to 1e-5, it recommends using LR=1.25e-5. But I've also come across something where I've read that, lower the learning rate, the higher the number of epochs the model should be trained for. Or am I just imagining it? :(

NaleRaphael commented 3 years ago

It was nothing, I'm glad it helps!

  1. I changed the skip_step to 0 and no, the loss curve isn't steep at the first 10 Epochs (plot below)

Yeah, but it would be more precise to check it through calculating the gradient of that filtered loss, just like how this line does: https://github.com/davidtvs/pytorch-lr-finder/blob/acc5e7ee7711a460bf3e1cc5c5f05575ba1e1b4b/torch_lr_finder/lr_finder.py#L505 Nevertheless, it's still an recommended point, you can still pick another learning rate according to that result.

  1. Does this repo help with understanding the number of epochs the model must be trained for? Because reading a few blogs and papers, it is the hyperparameter that frustrates me the most.

Unfortunately, no. Actually, most of the implementation for training would be based on early-stopping strategy. That means you have to run your model on validation set every epoch or every N iterations to see whether your model is going to over-fit your training set. And we will stop training once the validation loss is no longer decreasing for a number of epochs/iterations and even start to increase. At that moment, we assume that the model is the best model we get.

However, there was also a paper discussing a topic related to this strategy indirectly. It's call Deep double descent phenomenon, you can check out the section regarding "Epoch-wise double descent". From what i know, it somehow shows that early stopping strategy might not suitable for some cases. But since we don't have that much resource to explorer all possible solutions, early-stopping strategy is still a good one for us to train a model currently.

Therefore, regarding the question for setting the number of epochs to train, this might be the thing you want to know.

YashRunwal commented 3 years ago

@NaleRaphael Hi, Sorry for the late reply. Had to take a break ;)

Isn't early stopping used if the model starts to overfit? I believe my model is underfitting. Presumably, I think I need more data but still.

NaleRaphael commented 3 years ago

Hi @YashRunwal, No worry, we all have necessary things to do, just take your time!

Yes, early-stopping strategy is used to prevent overfitting. But since you mentioned that your model might go for underfitting, you might need to try some more data augmentation techniques if it's hard to collect more training data. albumentations is a library I would recommend to you.

Regarding the data augmentation thing, I think it could take no extra effort to use with LRFinder. But in any case, you can update this thread if you ran into any problem related to LRFinder.

davidtvs commented 7 months ago

Closing due to inactivity