an error in yolo_head.py -> def dynamic_k_matching()

upsx commented 2 years ago

log:

2021-11-20 20:33:42.464 | ERROR | yolox.core.launch:_distributed_worker:219 - An error has been caught in function '_distributed_worker', process 'MainProcess' (38881), thread 'MainThread' (140370955055296): Traceback (most recent call last):

File "/home/shuxin/tracking/ByteTrack/yolox/models/yolo_head_de_nostem_addse.py", line 362, in get_losses ) = self.get_assignments( # noqa │ └ <function YOLOXHead_de_nostem_addse.get_assignments at 0x7faa101a7430> └ YOLOXHead_de_nostem_addse( (cls_convs): ModuleList( (0): Sequential( (0): BaseConv( (conv): Conv2d(128, 1...

File "/home/shuxin/miniconda3/envs/fairmot/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) │ │ └ {} │ └ └ <function YOLOXHead_de_nostem_addse.get_assignments at 0x7faa101a73a0>

File "/home/shuxin/tracking/ByteTrack/yolox/models/yolo_head_de_nostem_addse.py", line 599, in get_assignments ) = self.dynamic_k_matching(cost, pair_wise_ious, gt_classes, num_gt, fg_mask) │ │ │ │ │ │ └ │ │ │ │ │ └ 166 │ │ │ │ └ │ │ │ └ │ │ └ │ └ <function YOLOXHead_de_nostem_addse.dynamic_k_matching at 0x7faa101a7550> └ YOLOXHead_de_nostem_addse( (cls_convs): ModuleList( (0): Sequential( (0): BaseConv( (conv): Conv2d(128, 1...

File "/home/shuxin/tracking/ByteTrack/yolox/models/yolo_head_de_nostem_addse.py", line 718, in dynamic_k_matching cost[gt_idx], k=dynamic_ks[gt_idx].item(), largest=False │ │ │ └ 0 │ │ └ │ └ 0 └

RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred: .......

problem:

i train yolox_s network on my dataset with input_size=(608, 1088)，the above error in yolo_head.py->def dynamic_k_matching() always happens and have no idea to slove it, can anyone help me? thks @FateScript

mahdiabdollahpour commented 2 years ago

I have the same issue.

upsx commented 2 years ago

I have the same issue.

try to use smaller lr, it may be effictive

mahdiabdollahpour commented 2 years ago

I have the same issue.

try to use smaller lr, it may be effictive

#813 says it because of Nan values in the output so reducing lr might work. Can Nan values happen because of too small lr?

upsx commented 2 years ago

I have the same issue.

try to use smaller lr, it may be effictive

#813 says it because of Nan values in the output so reducing lr might work. Can Nan values happen because of too small lr?

large lr sometimes results in Nan values, so you can reduce lr to slove the problem.

upsx commented 2 years ago

I have the same issue.

try to use smaller lr, it may be effictive

#813 says it because of Nan values in the output so reducing lr might work. Can Nan values happen because of too small lr?

large lr sometimes results in Nan values, so you can reduce lr to slove the problem.

my problem is same as #813, but the error logs dont print fully, which misleads me to care about def dynamic_k_matching() function

mahdiabdollahpour commented 2 years ago

I have the same issue.

try to use smaller lr, it may be effictive

#813 says it because of Nan values in the output so reducing lr might work. Can Nan values happen because of too small lr?

large lr sometimes results in Nan values, so you can reduce lr to slove the problem.

my problem is same as #813, but the error logs dont print fully, which misleads me to care about def dynamic_k_matching() function

Thanks. I added some if checks with torch.isnan(x).sum().item() and now I see pair_wise_ious_loss, cls_preds, obj_preds, bboxes_preds_per_image were Nan. self.basic_lr_per_img = 0.001 / 64.0 while my batch size is 1. I'll try lower batch sizes.

upsx commented 2 years ago

I have the same issue.

try to use smaller lr, it may be effictive

#813 says it because of Nan values in the output so reducing lr might work. Can Nan values happen because of too small lr?

large lr sometimes results in Nan values, so you can reduce lr to slove the problem.

my problem is same as #813, but the error logs dont print fully, which misleads me to care about def dynamic_k_matching() function

Thanks. I added some if checks with torch.isnan(x).sum().item() and now I see pair_wise_ious_loss, cls_preds, obj_preds, bboxes_preds_per_image were Nan. self.basic_lr_per_img = 0.001 / 64.0 while my batch size is 1. I'll try lower batch sizes.

the final lr = self.basic_lr_per_img * batch_size, so your lr have been small. changing the model head layer may also lead to Nan loss. so you can search other causes.

Megvii-BaseDetection / YOLOX

an error in yolo_head.py -> def dynamic_k_matching() #927

log:

problem: