Loss is None while training

matteosodano commented 3 years ago

I tried a training with the SUNRGBD dataset, and got the error Loss is None. Inspecting the code, it seems like it can only be caused by the loss function in ESANet/src/utils.py, and specifically here:

number_of_pixels_per_class = torch.bincount(targets.flatten().type(self.dtype), minlength=self.num_classes) divisor_weighted_pixel_sum = torch.sum(number_of_pixels_per_class[1:] * self.weight) # without void losses.append(torch.sum(loss_all) / divisor_weighted_pixel_sum)

My assumption is that divisor_weighted_pixel_sum can be 0 with some very 'unlucky' random cropping.

The following modification seems to solve the problem: divisor_weighted_pixel_sum = torch.sum(number_of_pixels_per_class[1:] * self.weight).clamp(min=1e-5) # without void

Let me know if you ever experienced something similar, or if you have a better fix.

danielS91 commented 3 years ago

We never faced this problem. The factor for random scaling is chosen between 1.0 and 1.4. So it's quite unlikely to pick a batch full of void. Which dataset and batchsize do you use for training?

matteosodano commented 3 years ago

I was using the SUNRGBD dataset with default parameters (thus, batch_size = 8). It was the very first run I did with the code, so I did not modify anything. I thought about the cropping because it happened at a random epoch, so it should not be a problem of corrupted image or similar.

TUI-NICR / ESANet

Loss is None while training #26