WongKinYiu / yolor

implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks (https://arxiv.org/abs/2105.04206)
GNU General Public License v3.0
1.98k stars 524 forks source link

Training on Multiple GPUs Error #19

Open OrjwanZaafarani opened 3 years ago

OrjwanZaafarani commented 3 years ago

How can I fix this?

Traceback (most recent call last):
  File "train.py", line 537, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 288, in train
    loss, loss_items = compute_loss(pred, targets.to(device), model)  # loss scaled by batch_size
  File "/yolor/utils/loss.py", line 66, in compute_loss
    tcls, tbox, indices, anchors = build_targets(p, targets, model)  # targets
  File "/yolor/utils/loss.py", line 145, in build_targets
    r = t[None, :, 4:6] / anchors[:, None]  # wh ratio
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
WongKinYiu commented 3 years ago

Do you use DDP training as shown in https://github.com/WongKinYiu/yolor#training ?

OrjwanZaafarani commented 3 years ago

I used this command: python -m torch.distributed.launch --nproc_per_node 2 --master_port 9527 train.py --batch-size 16 --img 1280 1280 --data coco.yaml --cfg cfg/yolor_p6.cfg --weights '' --device 0,1 --sync-bn --name yolor_p6 --hyp hyp.scratch.1280.yaml --epochs 300