Closed lyon-v closed 3 years ago
I got the gt_label , the max number was 4294967295 ,I didn't know what happened
This issue occurs when using cuda9. We recommend using cuda10 for training.
But I'm using cuda10.2 for training
Does this error occur definitely or occur randomly? And what have you modified?
This error occurs randomly . In this File "YOLOF/yolof/modeling/yolof.py", at about the 404 line: ' gt_classes[src_idx] = target_classes_o' .I print the info of 'gt_classes' and 'target_classes_o' . When this error occurs, the dtype of 'gt_classes.dtype' is None,but dtype of 'target_classes'_o is ok(int64). I don't konw why , just like Numeric overflow.Because 4294967295 is value of 2^32 . Only do 4294967295 and -4294967295 occurs here.
I just modify the config of _C.MODEL.YOLOF.DECODER.NUM_CLASSES The solution of mine is: gt_classes[src_idx] = torch.where(gt_classes[src_idx]==4294967295,0,gt_classes[src_idx]) gt_classes[src_idx] = torch.where(gt_classes[src_idx] == -4294967295, -1, gt_classes[src_idx])
Sir. the problem : there is only one category in my dataset ,so I change the config and run the code . can train it in several iteration sometimes, then meet this error.
Traceback (most recent call last): File "./tools/w_train.py", line 270, in
args=(args,),
File "/home/wuliang/cvprojects/detectron2/detectron2/engine/launch.py", line 82, in launch
main_func(args)
File "./tools/w_train.py", line 257, in main
return trainer.train()
File "/home/wuliang/cvprojects/detectron2/detectron2/engine/defaults.py", line 485, in train
super().train(self.start_iter, self.max_iter)
File "/home/wuliang/cvprojects/detectron2/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/home/wuliang/cvprojects/detectron2/detectron2/engine/defaults.py", line 495, in run_step
self._trainer.run_step()
File "/home/wuliang/cvprojects/detectron2/detectron2/engine/train_loop.py", line 273, in run_step
loss_dict = self.model(data)
File "/home/wuliang/anaconda3/envs/pyt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, **kwargs)
File "/home/wuliang/cvprojects/YOLOF/yolof/modeling/yolof.py", line 295, in forward
pred_logits, pred_anchor_deltas)
File "/home/wuliang/cvprojects/YOLOF/yolof/modeling/yolof.py", line 404, in losses
pred_class_logits[valid_idxs],
RuntimeError: CUDA error: device-side assert triggered