FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

793761775 commented 1 year ago

Excuse me, I have set up the environment and the dataset. Then I start training, but it shows""FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged." Can you please tell me how to solve this problem?Is it a learning rate issue?

Went-Liang commented 1 year ago

Could you please provide the training log? It may be caused by the start iteration of stage 2 or the setting of the sampling threshold.

793761775 commented 1 year ago

2023-07-28 11-33-09屏幕截图 Excuse me, this is the problem I had running the code, the command I used was " python train_net.py --dataset-dir /home/yangyz/UnSniffer/VOC --num-gpus 1 --config-file VOC-Detection /faster-rcnn/UnSniffer.yaml --random-seed 0 --resume". Then the file "UnSniffer.yaml" I did not change.

YH-2023 commented 1 year ago

Did you solve it, please @793761775

rohit901 commented 1 year ago

i got this same error, but when using COCO dataset to train the model and this error came at 12k iter which is the starting iter of VOS.

balazon commented 12 months ago

Could be related, but I got infs, and nans when trying to infer on my custom images. The root cause of the issue is probably some box scaling and clipping which resulted in predicted boxes that had zero width or height.

predicted_boxes.scale(scale_x, scale_y) predicted_boxes.clip(result.image_size) https://github.com/Went-Liang/UnSniffer/blob/dad023f7089b4b63c33723ef2ca0860782185bb8/detection/inference/inference_utils.py#L149-L150

For my case predicted_boxes contains coordinates that

already scaled, or
larger than input image size for some reason

I think first case applies, this means we scale them up, causing coordinates to overflow the image boundaries, then after clipping their x (or y) coordinates they will be both the same coordinate: width (or height). This later causes torch_ncut_detection's pairwise function to have full zero rows, and then _ncut_relabel's reciprocal converts the zeros to infs. d2 here contains those infs https://github.com/Went-Liang/UnSniffer/blob/dad023f7089b4b63c33723ef2ca0860782185bb8/detection/inference/ncut_torch.py#L74 Then matmul results in nans, and torch.linalg.eig(A) results in crash with: Intel MKL ERROR: Parameter 3 was incorrect on entry to DGEBAL.

A workaround for inference that worked for me was to turn off resizing by setting MIN_SIZE_TEST to 0 here: https://github.com/Went-Liang/UnSniffer/blob/dad023f7089b4b63c33723ef2ca0860782185bb8/detection/configs/VOC-Detection/faster-rcnn/UnSniffer.yaml#L17 in UnSniffer.yaml

But it would be better to turn off scaling in case output is already scaled, or if it isn't, then we probably need to filter out boxes with zero width/height or something

Went-Liang / UnSniffer

FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged. #5