Open Astroooh opened 2 years ago
Start Train Epoch 51/300: 10%|▉ | 29/304 [00:25<02:24, 1.90it/s, loss=6.71, lr=0.00028]/opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [37,0,0], thread: [0,0,0] Assertion input_val >= zero && input_val <= one failed. ....... /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [22,0,0], thread: [28,0,0] Assertion input_val >= zero && input_val <= one failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [22,0,0], thread: [29,0,0] Assertion input_val >= zero && input_val <= one failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [22,0,0], thread: [30,0,0] Assertion input_val >= zero && input_val <= one failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [22,0,0], thread: [31,0,0] Assertion input_val >= zero && input_val <= one failed. Traceback (most recent call last): File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/train.py", line 508, in fit_one_epoch(model_train, model, ema, yolo_loss, loss_history, eval_callback, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir, local_rank) File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/utils/utils_fit.py", line 39, in fit_one_epoch loss_value = yolo_loss(outputs, targets) File "/home/dell/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/nets/yolo_training.py", line 94, in forward return self.get_losses(x_shifts, y_shifts, expanded_strides, labels, torch.cat(outputs, 1)) File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/nets/yolo_training.py", line 161, in get_losses gt_matched_classes, fg_mask, pred_ious_this_matching, matched_gt_inds, num_fg_img = self.get_assignments( File "/home/dell/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, **kwargs) File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/nets/yolo_training.py", line 231, in get_assignments num_fg, gt_matched_classes, pred_ious_this_matching, matched_gt_inds = self.dynamic_k_matching(cost, pair_wise_ious, gt_classes, num_gt, fg_mask) File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/nets/yolo_training.py", line 348, in dynamic_kmatching , pos_idx = torch.topk(cost[gt_idx], k=dynamic_ks[gt_idx].item(), largest=False) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Epoch 51/300: 10%|▉ | 29/304 [00:26<04:12, 1.09it/s, loss=6.71, lr=0.00028]
input_val >= zero && input_val <= one
classes path
没有问题,之前用sgd优化器,训练没有出现问题
这个问题应该是你的class序号超出了num_classes,其中你自己可能修改了什么
请问您解决了吗?
没有解决
0 0令人惊奇
同样出现了这个问题, 在训练一段时间后出现
个人觉得这是训练过程中,网络的输入变成nan的原因
Start Train Epoch 51/300: 10%|▉ | 29/304 [00:25<02:24, 1.90it/s, loss=6.71, lr=0.00028]/opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [37,0,0], thread: [0,0,0] Assertion
fit_one_epoch(model_train, model, ema, yolo_loss, loss_history, eval_callback, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir, local_rank)
File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/utils/utils_fit.py", line 39, in fit_one_epoch
loss_value = yolo_loss(outputs, targets)
File "/home/dell/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/nets/yolo_training.py", line 94, in forward
return self.get_losses(x_shifts, y_shifts, expanded_strides, labels, torch.cat(outputs, 1))
File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/nets/yolo_training.py", line 161, in get_losses
gt_matched_classes, fg_mask, pred_ious_this_matching, matched_gt_inds, num_fg_img = self.get_assignments(
File "/home/dell/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(args, **kwargs)
File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/nets/yolo_training.py", line 231, in get_assignments
num_fg, gt_matched_classes, pred_ious_this_matching, matched_gt_inds = self.dynamic_k_matching(cost, pair_wise_ious, gt_classes, num_gt, fg_mask)
File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/nets/yolo_training.py", line 348, in dynamic_kmatching
, pos_idx = torch.topk(cost[gt_idx], k=dynamic_ks[gt_idx].item(), largest=False)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Epoch 51/300: 10%|▉ | 29/304 [00:26<04:12, 1.09it/s, loss=6.71, lr=0.00028]
input_val >= zero && input_val <= one
failed. ....... /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [22,0,0], thread: [28,0,0] Assertioninput_val >= zero && input_val <= one
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [22,0,0], thread: [29,0,0] Assertioninput_val >= zero && input_val <= one
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [22,0,0], thread: [30,0,0] Assertioninput_val >= zero && input_val <= one
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [22,0,0], thread: [31,0,0] Assertioninput_val >= zero && input_val <= one
failed. Traceback (most recent call last): File "/home/dell/projects/bubbliiiingv2022.5.8/yolox-pytorch-main/train.py", line 508, in