RuntimeError while training, tensor type ambiguity in validation code

Derpimort commented 4 years ago

Trying to train the network on a custom object detection dataset. I'm running it in a docker container on a single Tesla V100. Steps:

Install mish-cuda and all requirements from pip.
Copy the yolov4l-mish.yaml to parent directory and change the nc to 3.
Run python train.py --data ../data/data.yaml --cfg ../yolov4l-mish.yaml --img-size 480 --batch-size 16 --device 0 --cache-images --weights ''

It completes one full train loop and throws the following Error on the validation loop.

Traceback (most recent call last):
  File "train.py", line 468, in <module>
    train(hyp, tb_writer, opt, device)
  File "train.py", line 346, in train
    save_dir=log_dir)
  File "/workspace/Safety-Surveillance/PyTorch_YOLOv4/test.py", line 90, in test
    inf_out, train_out = model(img, augment=augment)  # inference and training outputs
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/Safety-Surveillance/PyTorch_YOLOv4/models/yolo.py", line 99, in forward
    return self.forward_once(x, profile)  # single-scale inference, train
  File "/workspace/Safety-Surveillance/PyTorch_YOLOv4/models/yolo.py", line 119, in forward_once
    x = m(x)  # run
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/Safety-Surveillance/PyTorch_YOLOv4/models/common.py", line 141, in forward
    return self.cv7(self.act(self.bn(torch.cat((y1, y2), dim=1))))
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward
    self.weight, self.bias, bn_training, exponential_average_factor, self.eps)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 2016, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'weight'; but type torch.cuda.FloatTensor does not equal torch.cuda
.HalfTensor (while checking arguments for cudnn_batch_norm)

Some helpful info:

CUDA Version: 10.2
Driver Version: 410.79
PyTorch Version: 1.6.0
torchvision Version: 0.7.0

Device detected and Namespace output:

Using CUDA Apex device0 _CudaDeviceProperties(name='Tesla V100-DGXS-32GB', total_memory=32478MB)

Namespace(batch_size=16, bucket='', cache_images=True, cfg='../yolov4l-mish.yaml', data='../data/data.yaml', device='0', epochs=300, evolve=False, hyp='', img_size=[480,
 480], local_rank=-1, multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, sync_bn=False, total_batch_
size=16, weights='', world_size=1)

StarBurstStream0 commented 4 years ago

I got the same issue, how could this happen?

Derpimort commented 4 years ago

I "kind of" fixed it, more of a workaround because now I can't use the half precision advantage, by commenting out the following lines https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/8f006d351bf1ac888239cfeaf6fcd4a31eb866ca/test.py#L51-L52

From what I've gathered is at some point the inputs are implicitly converted from half to float because the test code does convert them to half precision. I thought it was happening in the author's code and tried to change the following lines and add more explicit conversions... https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/699acbfea3fa8773d72ea4f2f71120858b6a2435/test.py#L77-L89

But no luck, will try and update if I get it working on half precision.

WongKinYiu commented 4 years ago

it due to pytorch 1.6 contains native amp. i used pytorch 1.5.1 and install apex for amp training.

Derpimort commented 4 years ago

Oh ok, makes sense. Great repo btw, it really helped alot.

WongKinYiu / PyTorch_YOLOv4

RuntimeError while training, tensor type ambiguity in validation code #75