Open joanChen0212 opened 3 months ago
Hi @joanChen0212 ,
Can you let me know whether you were using train.py
or train_torchvision_frcnn.py
for your training ? If you haven't tried yet with train_torchvision_frcnn.py
, then can you please give that also a try and let me know if there also you see this nan problem for your dataset.
Hi, I think you should check the targets after normalized and rescaled. I also got this situation when using train.py to train on DOTA/DIOR, and the issue was that I got target boxes which has the same x_max/x_min or y_max/ymin. Here is a snippet which i used to tackle the issue
Sorry I do not know how to put it in a nicer format
@explainingai-code Thank you for your reply. I encountered this issue while using train.py to train on the Cityscapes dataset. After some investigation, I found that the problem occurred with the images aachen_000130_000019_leftImg8bit and aachen_000131_000019_leftImg8bit, which caused NaN values to appear. After deleting these two files, the issue was successfully resolved. Thank you very much for your outstanding work. @DINHQuangDung1999 Thank you for your reply. I will check these two problematic data entries again to see if they were caused by the same issue. Thank you for your sharing.
Hello, I have a question I’d like to ask. I try to training this on my own dataset, but the loss often starts showing nan after 0-4 epochs. I’ve tried reducing the learning rate and applying gradient clipping, but neither seems to resolve the issue. Could you please offer me some advice? Thank you.