chensnathan / YOLOF

You Only Look One-level Feature (YOLOF), CVPR2021, Detectron2
MIT License
271 stars 28 forks source link

RuntimeError: CUDA error: device-side assert triggered #25

Closed lyon-v closed 3 years ago

lyon-v commented 3 years ago

Sir. the problem : there is only one category in my dataset ,so I change the config and run the code . can train it in several iteration sometimes, then meet this error.

Traceback (most recent call last): File "./tools/w_train.py", line 270, in args=(args,), File "/home/wuliang/cvprojects/detectron2/detectron2/engine/launch.py", line 82, in launch main_func(args) File "./tools/w_train.py", line 257, in main return trainer.train() File "/home/wuliang/cvprojects/detectron2/detectron2/engine/defaults.py", line 485, in train super().train(self.start_iter, self.max_iter) File "/home/wuliang/cvprojects/detectron2/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/home/wuliang/cvprojects/detectron2/detectron2/engine/defaults.py", line 495, in run_step self._trainer.run_step() File "/home/wuliang/cvprojects/detectron2/detectron2/engine/train_loop.py", line 273, in run_step loss_dict = self.model(data) File "/home/wuliang/anaconda3/envs/pyt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "/home/wuliang/cvprojects/YOLOF/yolof/modeling/yolof.py", line 295, in forward pred_logits, pred_anchor_deltas) File "/home/wuliang/cvprojects/YOLOF/yolof/modeling/yolof.py", line 404, in losses pred_class_logits[valid_idxs], RuntimeError: CUDA error: device-side assert triggered

lyon-v commented 3 years ago

I got the gt_label , the max number was 4294967295 ,I didn't know what happened

chensnathan commented 3 years ago

This issue occurs when using cuda9. We recommend using cuda10 for training.

lyon-v commented 3 years ago

But I'm using cuda10.2 for training

chensnathan commented 3 years ago

Does this error occur definitely or occur randomly? And what have you modified?

lyon-v commented 3 years ago

This error occurs randomly . In this File "YOLOF/yolof/modeling/yolof.py", at about the 404 line: ' gt_classes[src_idx] = target_classes_o' .I print the info of 'gt_classes' and 'target_classes_o' . When this error occurs, the dtype of 'gt_classes.dtype' is None,but dtype of 'target_classes'_o is ok(int64). I don't konw why , just like Numeric overflow.Because 4294967295 is value of 2^32 . Only do 4294967295 and -4294967295 occurs here.

lyon-v commented 3 years ago

I just modify the config of _C.MODEL.YOLOF.DECODER.NUM_CLASSES The solution of mine is: gt_classes[src_idx] = torch.where(gt_classes[src_idx]==4294967295,0,gt_classes[src_idx]) gt_classes[src_idx] = torch.where(gt_classes[src_idx] == -4294967295, -1, gt_classes[src_idx])