PRBonn / lidar-bonnetal

Semantic and Instance Segmentation of LiDAR point clouds for autonomous driving
http://semantic-kitti.org
MIT License
959 stars 206 forks source link

Validation loss increases, while Validation accuracy is also increasing #64

Closed NagarajDesai1 closed 2 years ago

NagarajDesai1 commented 4 years ago

Hello @tano297 , I am trying to train the network on NVIDIA Titan RTX 24 GB GPU, I find that both val_loss and val_acc are increasing, while train_loss is reducing and train_acc is increasing as expected.

I have used darknet53 config without any changes.

Please suggest me what could be the reason for this behavior.

Screenshot from 2020-09-17 16-00-43 Screenshot from 2020-09-17 16-00-52

NagarajDesai1 commented 4 years ago

Hello @jbehley, do you have any suggestions as to what could be the issue?

I notice that loss for train and validation is being updated differently in train and validate functions. Could this be the issue?

kosmastsk commented 3 years ago

Any progress on this?

jbehley commented 3 years ago

Sorry, for the long delay with the answer.

Is this your own data? If yes, you might want to increase the training set to prevent the overfitting to the training data. Data augmentation could also help, like adding noise to the points.

NagarajDesai1 commented 3 years ago

Hi @jbehley,

thanks for your reply.

No, this training was performed on the SemanticKITTI dataset. Code is also the same as in the git repo. I am also using the train/validation split as used by you.

I have no idea why validation loss increases. Please help

NagarajDesai1 commented 3 years ago

Any progress on this?

@kosmastsk did you also face a similar issue?

kosmastsk commented 3 years ago

@NagarajDesai1 yes, that's the behavior I am seeing while training a model with both KITTI dataset and my own data. I still haven't reached to high number of epochs to see if that will change. I am using also the default Darknet53 configuration

jbehley commented 3 years ago

hmmm, I don't know if this behavior is "normal" since I did not train the network. What is good is that the validation IoU and accuracy still increasing. (though the validation loss is not decreasing, but this might have different reasons.) The decrease in the training loss seems normal.

Would be nice if you could provide feedback with more epochs. As I said that IoU and Accuracy is still are increasing seems good. I would also have a look at the actual validation set results (I think there are some images plotted) and this can be activated via a flag.

But maybe something with the logging of the validation loss is off...

kosmastsk commented 3 years ago

The images from the validation set results look like the segmentation result is good, for an accuracy of around 83%. But I will report again when the training process finishes

NagarajDesai1 commented 3 years ago

Any update on this?

kosmastsk commented 3 years ago

From my side, as far as I remember that issue still existed even after several epochs, but I do not think that it affected directly the training process. Probably it is something related to the logging of the validation loss, as @jbehley mentioned, but I didn't search this further

iris0329 commented 3 years ago

Actually, I also meet this issue when re-training DarkNet and SalsaNext at here

I think it is a problem caused by Cross-Entropy Loss.

Think about the Matrix_A before argmax function, its size is (h, w, class_num). After applying the argmax function on this Matrix_A, we got a new Matrixd_B and its size is (h, w, 1).

The increment IoU means that the Matrix_B becomes more accurate. However, the Cross-Entropy Loss is computed based on Matrix_A.

For example,

There are two pixels and ground truth is [2, 0]. At the beginning, Matrix_A is [[0.1, 0.2, 0.7], [0.1, 0.2, 0.7]], after applying the argmax function, we got the prediction [2, 2], 50% accuracy rate.

With the training process continuing, the Matrix_A becomes to [[0.2, 0.3, 0.5], [0.5, 0.3, 0.2]] after applying the argmax function, the prediction is [2, 0], the loss will increase but the accuracy is higher.

So it seems, as the training progressed, there was a tendency for the predictions to average out across all categories. But I still don't know how to solve it. If anyone has a suggestion about it, I would be very grateful.

iris0329 commented 3 years ago

I found some logs I wrote last year.

The training set is kind of different from the original setting, it uses the cloud of [-90, 90] front view.

It cloud be found that it's a common issue for at least SalsaNext, SqueezeSegv3-21, RangeNet21++and RangeNet53++.

1629777085

There is only val-loss remained, I lost the val-mIoU logs.

jbehley commented 3 years ago

Sorry for the late reply.

You should also look at the miou. I would suggest to compute it after each epoch. The distribution of the validation set is different, which leads to a different behavior in loss space. when the miou on the validation set is not decreasing, everything is fine.

plot maybe also the iou per class, log also some (fixed) images with the semantics. these are usually much better to detect if something goes wrong. usually larger classes are getting first better, then smaller...

from my experience CE loss is not a good proxy for performance in the end.

iris0329 commented 3 years ago

Hi, @jbehley ,

Thanks for your reply.

Yes, actually, although the val loss is increasing, the val miou is increasing at the same time and miou curve will become flatten at the end.

I also notice CE loss is not a good choice, so in our work, so we use multi-class focal loss, and it brings performance increasing.