Error when training for CUDA 11.6

tailtNinjavanVN commented 2 years ago

Hi Bro, Thanks for nice your project!

Train script that i tested for your confs:

python polygon_train.py --weights "" --cfg polygon_yolov5s_ucas.yaml  \
--data polygon_ucas.yaml --hyp hyp.ucas.yaml --img-size 1024 \
--epochs 1000 --batch-size 1 --noautoanchor --polygon --cache

However, i got errors when training:

Traceback (most recent call last):
  File "polygon_train.py", line 551, in <module>
    train(hyp, opt, device, tb_writer, polygon=opt.polygon)
  File "polygon_train.py", line 312, in train
    loss, loss_items = compute_loss(pred, targets.to(device))  # loss scaled by batch_size
  File "/home/tailt/Workspace/PolygonObjectDetection/polygon-yolov5/utils/loss.py", line 258, in __call__
    tcls, tbox, indices, anchors = self.build_targets(p, targets)  # targets for computing loss
  File "/home/tailt/Workspace/PolygonObjectDetection/polygon-yolov5/utils/loss.py", line 385, in build_targets
    indices.append((b, a, gj.clamp_(0, gain[3] - 1), gi.clamp_(0, gain[2] - 1)))  # image, anchor, grid indices
RuntimeError: result type Float can't be cast to the desired output type long int

My installs:

Ubuntu 20.04
CUDA 11.6
torch 1.13.0.dev20220619+cu116
torchvision 0.14.0.dev20220619+cu116

How to install torch, torchvision:

pip install torch torchvision --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu116

Hope your project could be fixed for torch+torchvision (CUDA 11.6) in future. At present, i think that this error is difficult to fix!

Thanks advanced for your replying!

Hezhexi2002 commented 2 years ago

Hi Bro, Thanks for nice your project!

Train script that i tested for your confs:
python polygon_train.py --weights "" --cfg polygon_yolov5s_ucas.yaml  \
--data polygon_ucas.yaml --hyp hyp.ucas.yaml --img-size 1024 \
--epochs 1000 --batch-size 1 --noautoanchor --polygon --cache
However, i got errors when training:
Traceback (most recent call last):
  File "polygon_train.py", line 551, in <module>
    train(hyp, opt, device, tb_writer, polygon=opt.polygon)
  File "polygon_train.py", line 312, in train
    loss, loss_items = compute_loss(pred, targets.to(device))  # loss scaled by batch_size
  File "/home/tailt/Workspace/PolygonObjectDetection/polygon-yolov5/utils/loss.py", line 258, in __call__
    tcls, tbox, indices, anchors = self.build_targets(p, targets)  # targets for computing loss
  File "/home/tailt/Workspace/PolygonObjectDetection/polygon-yolov5/utils/loss.py", line 385, in build_targets
    indices.append((b, a, gj.clamp_(0, gain[3] - 1), gi.clamp_(0, gain[2] - 1)))  # image, anchor, grid indices
RuntimeError: result type Float can't be cast to the desired output type long int
My installs:

Ubuntu 20.04

CUDA 11.6

torch 1.13.0.dev20220619+cu116

torchvision 0.14.0.dev20220619+cu116

How to install torch, torchvision:

pip install torch torchvision --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu116

Hope your project could be fixed for torch+torchvision (CUDA 11.6) in future. At present, i think that this error is difficult to fix!

Thanks advanced for your replying!

I installed cuda11.6 on win10 and it works fine even with a warning，I think you try it on win10 again

pocca2048 commented 2 years ago

It is not due to cuda version but due to torch version! See https://github.com/ultralytics/yolov5/issues/8405

XinzeLee / PolygonObjectDetection

Error when training for CUDA 11.6 #26