Open upsx opened 2 years ago
I have the same issue.
I have the same issue.
try to use smaller lr, it may be effictive
I have the same issue.
try to use smaller lr, it may be effictive
#813 says it because of Nan values in the output so reducing lr might work. Can Nan values happen because of too small lr?
I have the same issue.
try to use smaller lr, it may be effictive
#813 says it because of Nan values in the output so reducing lr might work. Can Nan values happen because of too small lr?
large lr sometimes results in Nan values, so you can reduce lr to slove the problem.
I have the same issue.
try to use smaller lr, it may be effictive
#813 says it because of Nan values in the output so reducing lr might work. Can Nan values happen because of too small lr?
large lr sometimes results in Nan values, so you can reduce lr to slove the problem.
my problem is same as #813, but the error logs dont print fully, which misleads me to care about def dynamic_k_matching() function
I have the same issue.
try to use smaller lr, it may be effictive
#813 says it because of Nan values in the output so reducing lr might work. Can Nan values happen because of too small lr?
large lr sometimes results in Nan values, so you can reduce lr to slove the problem.
my problem is same as #813, but the error logs dont print fully, which misleads me to care about def dynamic_k_matching() function
Thanks. I added some if checks with torch.isnan(x).sum().item()
and now I see pair_wise_ious_loss, cls_preds, obj_preds, bboxes_preds_per_image
were Nan. self.basic_lr_per_img = 0.001 / 64.0
while my batch size is 1. I'll try lower batch sizes.
I have the same issue.
try to use smaller lr, it may be effictive
#813 says it because of Nan values in the output so reducing lr might work. Can Nan values happen because of too small lr?
large lr sometimes results in Nan values, so you can reduce lr to slove the problem.
my problem is same as #813, but the error logs dont print fully, which misleads me to care about def dynamic_k_matching() function
Thanks. I added some if checks with
torch.isnan(x).sum().item()
and now I seepair_wise_ious_loss, cls_preds, obj_preds, bboxes_preds_per_image
were Nan.self.basic_lr_per_img = 0.001 / 64.0
while my batch size is 1. I'll try lower batch sizes.
the final lr = self.basic_lr_per_img * batch_size, so your lr have been small. changing the model head layer may also lead to Nan loss. so you can search other causes.
log:
2021-11-20 20:33:42.464 | ERROR | yolox.core.launch:_distributed_worker:219 - An error has been caught in function '_distributed_worker', process 'MainProcess' (38881), thread 'MainThread' (140370955055296): Traceback (most recent call last):
File "/home/shuxin/tracking/ByteTrack/yolox/models/yolo_head_de_nostem_addse.py", line 362, in get_losses ) = self.get_assignments( # noqa │ └ <function YOLOXHead_de_nostem_addse.get_assignments at 0x7faa101a7430> └ YOLOXHead_de_nostem_addse( (cls_convs): ModuleList( (0): Sequential( (0): BaseConv( (conv): Conv2d(128, 1...
File "/home/shuxin/miniconda3/envs/fairmot/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) │ │ └ {} │ └
└ <function YOLOXHead_de_nostem_addse.get_assignments at 0x7faa101a73a0>
File "/home/shuxin/tracking/ByteTrack/yolox/models/yolo_head_de_nostem_addse.py", line 599, in get_assignments ) = self.dynamic_k_matching(cost, pair_wise_ious, gt_classes, num_gt, fg_mask) │ │ │ │ │ │ └
│ │ │ │ │ └ 166
│ │ │ │ └
│ │ │ └
│ │ └
│ └ <function YOLOXHead_de_nostem_addse.dynamic_k_matching at 0x7faa101a7550>
└ YOLOXHead_de_nostem_addse(
(cls_convs): ModuleList(
(0): Sequential(
(0): BaseConv(
(conv): Conv2d(128, 1...
File "/home/shuxin/tracking/ByteTrack/yolox/models/yolo_head_de_nostem_addse.py", line 718, in dynamic_k_matching cost[gt_idx], k=dynamic_ks[gt_idx].item(), largest=False │ │ │ └ 0 │ │ └
│ └ 0
└
RuntimeError: CUDA error: device-side assert triggered
During handling of the above exception, another exception occurred: .......
problem:
i train yolox_s network on my dataset with input_size=(608, 1088),the above error in yolo_head.py->def dynamic_k_matching() always happens and have no idea to slove it, can anyone help me? thks @FateScript