bubbliiiing / yolact-pytorch

这是一个yolact-pytorch的库,可用于训练自己的数据集
MIT License
70 stars 12 forks source link

訓練coco數據集時發生error #5

Closed Moris-Zhan closed 2 years ago

Moris-Zhan commented 2 years ago

C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [48,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [49,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [50,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [51,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [52,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [53,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [54,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [55,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [56,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [57,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [58,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [59,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [60,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [61,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [62,0,0] Assertion input_val >= zero && input_val <= one failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: block: [40,0,0], thread: [63,0,0] Assertion input_val >= zero && input_val <= one failed. Epoch 1/50: 64%|██████████████████████████████████████▋ | 9308/14658 [52:18<30:03, 2.97it/s, lr=0.0001, total_loss=nan] Traceback (most recent call last): File "train.py", line 192, in epoch_step, epoch_step_val, gen, gen_val, end_epoch, Cuda) File "D:\WorkSpace\JupyterWorkSpace\yolact-pytorch-main\utils\utils_fit.py", line 34, in fit_one_epoch losses = multi_loss(outputs, targets, masks_gt, num_crowds) File "C:\Users\Leyan\anaconda3\envs\tensorflow\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "D:\WorkSpace\JupyterWorkSpace\yolact-pytorch-main\nets\yolact_training.py", line 176, in forward losses['M'] = self.lincomb_mask_loss(positive_bool, pred_masks, pred_proto, mask_gt, anchor_max_box, anchor_max_index) 6.125 File "D:\WorkSpace\JupyterWorkSpace\yolact-pytorch-main\nets\yolact_training.py", line 280, in lincomb_mask_loss pos_coef = pred_masks[i, positive_bool[i]] RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

bubbliiiing commented 2 years ago

classespath设置错误应该

Moris-Zhan commented 2 years ago

我是用最原始的 'coco_classes.txt'

Moris-Zhan commented 2 years ago

image 後來檢查發現是P3有出現Nan想問這是什摸問題呢?

bubbliiiing commented 2 years ago

我试过……应该不是这个问题,我是正常训练的…………你是coco2017吗?

Moris-Zhan commented 2 years ago

是的 不過我在2014也有出現此結果 看起來是訓練過程中 梯度爆炸 原因是出在前一次訓練loss出現nan image

bubbliiiing commented 2 years ago

或许学习率可以调整一下0 0,我的权值是转换的,并没有完整的训练过

Moris-Zhan commented 2 years ago

那可能轉換完格式上面有問題 還是我可以參考一下您的環境 因為我在bbox_loss還遇到inf的狀況 image

bubbliiiing commented 2 years ago

0 0 转了什么格式啊……

bubbliiiing commented 2 years ago

主要我得电脑训练不起coco……就没有尝试,小得数据集没什么问题……就没判断了

Moris-Zhan commented 2 years ago

我這邊後來check到兩個問題 一個是bbox_loss的inf問題 後來用eps做修正 另一個是lincomb_mask_loss的nan問題 我在center_size後面加上eps避免除零的情況就ok了

bubbliiiing commented 2 years ago

我忙完这段看看~~

bubbliiiing commented 2 years ago

两个eps对吧

Moris-Zhan commented 2 years ago

是的