Closed NagarajDesai1 closed 2 years ago
Hello @jbehley, do you have any suggestions as to what could be the issue?
I notice that loss for train and validation is being updated differently in train and validate functions. Could this be the issue?
Any progress on this?
Sorry, for the long delay with the answer.
Is this your own data? If yes, you might want to increase the training set to prevent the overfitting to the training data. Data augmentation could also help, like adding noise to the points.
Hi @jbehley,
thanks for your reply.
No, this training was performed on the SemanticKITTI dataset. Code is also the same as in the git repo. I am also using the train/validation split as used by you.
I have no idea why validation loss increases. Please help
Any progress on this?
@kosmastsk did you also face a similar issue?
@NagarajDesai1 yes, that's the behavior I am seeing while training a model with both KITTI dataset and my own data. I still haven't reached to high number of epochs to see if that will change. I am using also the default Darknet53 configuration
hmmm, I don't know if this behavior is "normal" since I did not train the network. What is good is that the validation IoU and accuracy still increasing. (though the validation loss is not decreasing, but this might have different reasons.) The decrease in the training loss seems normal.
Would be nice if you could provide feedback with more epochs. As I said that IoU and Accuracy is still are increasing seems good. I would also have a look at the actual validation set results (I think there are some images plotted) and this can be activated via a flag.
But maybe something with the logging of the validation loss is off...
The images from the validation set results look like the segmentation result is good, for an accuracy of around 83%. But I will report again when the training process finishes
Any update on this?
From my side, as far as I remember that issue still existed even after several epochs, but I do not think that it affected directly the training process. Probably it is something related to the logging of the validation loss, as @jbehley mentioned, but I didn't search this further
Actually, I also meet this issue when re-training DarkNet and SalsaNext at here
I think it is a problem caused by Cross-Entropy Loss.
Think about the Matrix_A before argmax function, its size is (h, w, class_num). After applying the argmax function on this Matrix_A, we got a new Matrixd_B and its size is (h, w, 1).
The increment IoU means that the Matrix_B becomes more accurate. However, the Cross-Entropy Loss is computed based on Matrix_A.
For example,
There are two pixels and ground truth is [2, 0]. At the beginning, Matrix_A is [[0.1, 0.2, 0.7], [0.1, 0.2, 0.7]], after applying the argmax function, we got the prediction [2, 2], 50% accuracy rate.
With the training process continuing, the Matrix_A becomes to [[0.2, 0.3, 0.5], [0.5, 0.3, 0.2]] after applying the argmax function, the prediction is [2, 0], the loss will increase but the accuracy is higher.
So it seems, as the training progressed, there was a tendency for the predictions to average out across all categories. But I still don't know how to solve it. If anyone has a suggestion about it, I would be very grateful.
I found some logs I wrote last year.
The training set is kind of different from the original setting, it uses the cloud of [-90, 90] front view.
It cloud be found that it's a common issue for at least SalsaNext
, SqueezeSegv3-21
, RangeNet21++
and RangeNet53++
.
There is only val-loss remained, I lost the val-mIoU logs.
Sorry for the late reply.
You should also look at the miou. I would suggest to compute it after each epoch. The distribution of the validation set is different, which leads to a different behavior in loss space. when the miou on the validation set is not decreasing, everything is fine.
plot maybe also the iou per class, log also some (fixed) images with the semantics. these are usually much better to detect if something goes wrong. usually larger classes are getting first better, then smaller...
from my experience CE loss is not a good proxy for performance in the end.
Hi, @jbehley ,
Thanks for your reply.
Yes, actually, although the val loss is increasing, the val miou is increasing at the same time and miou curve will become flatten at the end.
I also notice CE loss is not a good choice, so in our work, so we use multi-class focal loss, and it brings performance increasing.
Hello @tano297 , I am trying to train the network on NVIDIA Titan RTX 24 GB GPU, I find that both val_loss and val_acc are increasing, while train_loss is reducing and train_acc is increasing as expected.
I have used darknet53 config without any changes.
Please suggest me what could be the reason for this behavior.