Closed yiningzeng closed 4 years ago
That's not the end of training. The training has stopped because your model has diverged to nans of infinite values.
That's not the end of training. The training has stopped because your model has diverged to nans of infinite values.
It means there is something wrong with my custom datasets?
I can only say there is something wrong in your training -- which is a combination of your dataset, your model and your configurations.
I can only say there is something wrong in your training -- which is a combination of your dataset, your model and your configurations.
thanks. I will check
I got the same error when training COCO2017 dataset. ps. I have not modified any config or code.
I've also noticed the same issue when training out of the box for LVIS Instance Segmentation (specifically mask_rcnn_R_101_FPN_1x.yaml
). The only modification I made is changing IMS_PER_BATCH from 16 -> 4.
I got the same error when training COCO2017 dataset. ps. I have not modified any config or code.
Hi @yiningzeng , could you reopen this issue?
The only modification I made is changing IMS_PER_BATCH from 16 -> 4.
That definitely sounds like a modification that could lead to this issue.
I got the same error when training COCO2017 dataset. ps. I have not modified any config or code.
If you run into this issue with unmodified config and code, please include details following the issue template, with full command and full logs.
I would like to say, assert torch.isfinite(deltas).all().item()
is very sensitive when changing the hyperparameter, such as learning rate, batch size etc. Well, setting learning rate half (0.02 -> 0.01) solve this problem.
The only modification I made is changing IMS_PER_BATCH from 16 -> 4.
That definitely sounds like a modification that could lead to this issue.
I got the same error when training COCO2017 dataset. ps. I have not modified any config or code.
If you run into this issue with unmodified config and code, please include details following the issue template, with full command and full logs.
I indeed change num of gpus from 8 to 4, that may lead this error.
The only modification I made is changing IMS_PER_BATCH from 16 -> 4.
That definitely sounds like a modification that could lead to this issue.
I got the same error when training COCO2017 dataset. ps. I have not modified any config or code.
If you run into this issue with unmodified config and code, please include details following the issue template, with full command and full logs.
I indeed change num of gpus from 8 to 4, that may lead this error.
I changed the datasets and set GPU = 5,IMS per batch = 40 to run on 6 gpus. It works normally.
Correct me if that is not the case, but it seems to me that everyone who encounters this error have made changes to the default training settings. And many have also fixed it after tuning the setting a bit more. This does not sound like a detectron2 problem, therefore closing.
Environment