WongKinYiu / yolor

implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks (https://arxiv.org/abs/2105.04206)
GNU General Public License v3.0
1.98k stars 524 forks source link

RuntimeError: Trying to create tensor with negative dimension #260

Open Rusteam opened 2 years ago

Rusteam commented 2 years ago

Hi there,

while training I get the following error at test stage after some number of epochs:

Traceback (most recent call last):
  File "/usr/src/app/pipelines/yolor/../../src/models/yolor/train.py", line 537, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "/usr/src/app/pipelines/yolor/../../src/models/yolor/train.py", line 336, in train
    results, maps, times = test.test(opt.data,
  File "/usr/src/app/src/models/yolor/test.py", line 134, in test
    output = non_max_suppression(inf_out, conf_thres=conf_thres, iou_thres=iou_thres)
  File "/usr/src/app/src/models/yolor/utils/general.py", line 341, in non_max_suppression
    i = torch.ops.torchvision.nms(boxes, scores, iou_thres)
  File "/usr/local/lib/python3.9/dist-packages/torch/_ops.py", line 142, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: Trying to create tensor with negative dimension -726820594: [-726820594]

My env:

torch=='1.12.0.dev20220314+cu102'
torchvision=='0.13.0.dev20220314+cu102'
python 3.9.10
YiCheno commented 2 years ago

Hi there,

I got an error message same as you on batch_size=2. I think the error was about batch_size, because i trier to change batch_size=3, the error disappeared.

I don't know the total reason for error, but I can train my dataset on this method. If I find the reason, I will tell you in here.

Hope can halp you.

My env:

python 3.7
torch==1.7.0+cu101 
torchvision==0.8.1+cu101 
torchaudio==0.7.0

GPU: RTX2080Ti 11G
Rusteam commented 2 years ago

I'm not sure about batch size, because it happens after a some number of epochs. Let's say it has been training fine and testing fine for 15 epochs and then suddenly it throws this error.

Also it feels that the value is a box coordinate and it should not be that high.

YiCheno commented 1 year ago

I'm not sure about batch size, because it happens after a some number of epochs. Let's say it has been training fine and testing fine for 15 epochs and then suddenly it throws this error.

Also it feels that the value is a box coordinate and it should not be that high.

Update: I debug the code. In ./utils/general.py here, I finded the reason of why happened this error. In this file's 320 ~ 350 line, you can see the follow code:

320     # Box (center x, center y, width, height) to (x1, y1, x2, y2)
321     box = xywh2xyxy(x[:, :4])
... ...

347     # Batched NMS
348     c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes
349     boxes, scores = x[:, :4] + c, x[:, 4]  # boxes (offset by class), scores
350     i = torch.ops.torchvision.nms(boxes, scores, iou_thres)

You can try to debug the code when you train your models, In the 350 line, you can see the boxes's size variable is a large, but boxes(350 line) and box(321 line) is float32 and float16 type on your GPU, so I think the error is happended in here.

My solution: I tried to change of ./test.py's conf_thres in 35 line, like following:

31    def test(data,
32             weights=None,
33             batch_size=16,
34             imgsz=640,
35             conf_thres=0.001,
36             iou_thres=0.6,  # for NMS
37             save_json=False,
38             single_cls=False,
39             augment=False,
40             verbose=False,
41             model=None,
42             dataloader=None,
43             save_dir=Path(''),  # for saving images
44             save_txt=False,  # for auto-labelling
45             save_conf=False,
46             plots=True,
47             log_imgs=0):  # number of logged images

# After modification.

31    def test(data,
32             weights=None,
33             batch_size=16,
34             imgsz=640,
35             conf_thres=0.01,
36             iou_thres=0.6,  # for NMS
37             save_json=False,
38             single_cls=False,
39             augment=False,
40             verbose=False,
41             model=None,
42             dataloader=None,
43             save_dir=Path(''),  # for saving images
44             save_txt=False,  # for auto-labelling
45             save_conf=False,
46             plots=True,
47             log_imgs=0):  # number of logged images

This method can eliminate this error. Hope can be help you. @Rusteam

Rusteam commented 1 year ago

Did it help?

YiCheno commented 1 year ago

Yes, The method can be help me.

I used my dataset on YOLOR. Because my dataset is mini object detection, and I changed YOLOR's architecture, this is the reason for producing a lot of boxes. The method would reduce a lot of boxes, You should adjust your conf_thres, according to your dataset and model architecture. It is worth not that, You couldn't be boxes become to few, since the model would use them. If you want to a definite boxes's parameter, You can refer to the official example.