Open adamfarquhar opened 3 years ago
My thoughts,
learner.fp16
is making every internal function to accept tensors of different dtype
.
I recommend to use PyTorch Lightning trainer and have a go with fp16
. Since I am very much sure that torchvision allows mixed precision training for FRCNN.
Definetely the error stack from rpn
is unrelated to torchvision. I think it is fastai
issue that mishandles these tensors.
A minimal example is here
import torch, torchvision
device = torch.device('cuda')
model = torchvision.models.detection.fasterrcnn_resnet50_fpn()
model.to(device)
input = [torch.rand(3, 300, 400, device=device)]
boxes = torch.rand((5, 4), dtype=torch.float32, device=device)
boxes[:, 2:] += boxes[:, :2]
target = [{"boxes": boxes,
"labels": torch.zeros(5, dtype=torch.int64, device=device),
"image_id": 4,
"area": torch.zeros(5, dtype=torch.float32, device=device),
"iscrowd": torch.zeros((5,), dtype=torch.int64, device=device)}]
# use automatic mixed precision
with torch.cuda.amp.autocast():
loss_dict = model(input, target)
losses = sum(loss for loss in loss_dict.values())
# perform backward outside of autocast context manager
losses.backward()
Here too the FRCNN calls the rpn.py
code, which contains these functions.
Also the dtype doesn't matter for box_iou
or any box operations, since it expects tensor which can be any floating point
tensor.
Can you train once with lightning and have a go? If error still persists let me know, the above pytorch code will defintely work fine though.
I can confirm that the same issue exists using PyTorch Lightning.
Steps to reproduce:
precision=16
to the pl.Trainer
args@ai-fast-track it's very likely a lot of effort to fix this. Shall I close?
@ai-fast-track wdyt?
🐛 Bug
Describe the bug When I try to train a Faster RCNN model using IceVision and Fast.AI, it fails with the RuntimeError:
Expected object of scalar type c10::Half but got scalar type float for argument 'other'
To Reproduce Steps to reproduce the behavior:
faster_rcnn.fastai.learner
under full precisionlearn.to_fp16()
Expected behavior Training as normal.
Desktop (please complete the following information):
Additional context