Training on my own dataset but get loss nan

explainingai-code / FasterRCNN-PyTorch

This repo implements simple Faster RCNN model in PyTorch with all the essential components.

15 stars 5 forks source link

Training on my own dataset but get loss nan #1

Open joanChen0212 opened 3 months ago

joanChen0212 commented 3 months ago

Hello, I have a question I’d like to ask. I try to training this on my own dataset, but the loss often starts showing nan after 0-4 epochs. I’ve tried reducing the learning rate and applying gradient clipping, but neither seems to resolve the issue. Could you please offer me some advice? Thank you.

explainingai-code commented 3 months ago

Hi @joanChen0212 , Can you let me know whether you were using train.py or train_torchvision_frcnn.py for your training ? If you haven't tried yet with train_torchvision_frcnn.py, then can you please give that also a try and let me know if there also you see this nan problem for your dataset.

DINHQuangDung1999 commented 3 months ago

Hi, I think you should check the targets after normalized and rescaled. I also got this situation when using train.py to train on DOTA/DIOR, and the issue was that I got target boxes which has the same x_max/x_min or y_max/ymin. Here is a snippet which i used to tackle the issue

Sorry I do not know how to put it in a nicer format

joanChen0212 commented 3 months ago

@explainingai-code Thank you for your reply. I encountered this issue while using train.py to train on the Cityscapes dataset. After some investigation, I found that the problem occurred with the images aachen_000130_000019_leftImg8bit and aachen_000131_000019_leftImg8bit, which caused NaN values to appear. After deleting these two files, the issue was successfully resolved. Thank you very much for your outstanding work. @DINHQuangDung1999 Thank you for your reply. I will check these two problematic data entries again to see if they were caused by the same issue. Thank you for your sharing.