bubbliiiing / yolov7-pytorch

这是一个yolov7的库,可以用于训练自己的数据集。
GNU General Public License v3.0
861 stars 150 forks source link

训练几个epoch后出错 #72

Open weitaoO0 opened 1 year ago

weitaoO0 commented 1 year ago

请问大佬,模型一开始训练没问题,训练4,5轮之后出现这种情况是为什么?

| 0/62 [00:00<?, ?it/s<class 'dict'>C :/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. Traceback (most recent call last): File "c:/Users/Admin/Desktop/yolov7/train_all_snr.py", line 776, in train(SNR,0,50) File "c:/Users/Admin/Desktop/yolov7/train_all_snr.py", line 557, in train fit_one_epoch(model_train, model, ema, yolo_loss, loss_history, eval_callback, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir, local_rank) File "c:\Users\Admin\Desktop\yolov7\utils\utils_fit.py", line 93, in fit_one_epoch loss_value = yolo_loss(outputs, targets, images) File "c:\Users\Admin\Desktop\yolov7\nets\yolotraining.py", line 103, in call bs, as, gjs, gis, targets, anchors = self.build_targets(predictions, targets, imgs) File "c:\Users\Admin\Desktop\yolov7\nets\yolo_training.py", line 377, in build_targets if (anchor_matching_gt > 1).sum() > 0: RuntimeError: CUDA error: device-side assert triggered

bubbliiiing commented 1 year ago

我再处理一下,我印象中处理过……

bubbliiiing commented 1 year ago

具体的报错的截图有吗,我这里不知道怎么复现

weitaoO0 commented 1 year ago

就是这段代码会出问题,只有这个截图,因为这个错误不是每次都出现。 image

weitaoO0 commented 1 year ago

image

weitaoO0 commented 1 year ago

大佬,我猜测可能是这一行代码的问题 image 我在对数里面加了一个小常数,程序就不报错了,但是训练的计算出来的loss值突然就很大 image 这是为啥?

bubbliiiing commented 1 year ago

数据集可分享吗 我想直接自己训练看看,因为蛮多人有这个问题的,我不知道原因

bubbliiiing commented 1 year ago

你开fp16了吗

weitaoO0 commented 1 year ago

你开fp16了吗

没有

weitaoO0 commented 1 year ago

我的问题暂时已经解决,是网络在验证的时候有时会输出nan值和inf,很奇怪的是只有验证的时候会出现训练的时候就不会,并且也只是偶尔出现。

bubbliiiing commented 1 year ago

这个……诶,我自己想解决这个问题就很烦。

bubbliiiing commented 1 year ago

复现不了