facebookresearch / maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
MIT License
9.29k stars 2.5k forks source link

RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half' #1048

Open dedoogong opened 5 years ago

dedoogong commented 5 years ago

🐛 Bug

I can run training of normal faster/mask rcnn with R-50/FPN on FP16. But I failed to run it with Retinanet(sigmoid focal loss is the problem)

I guess theres couldbe 2 options.

  1. implement half based cuda code.
  2. use apex to handle fp32/fp16 conversion during for/backward pass.

I found the current Sigmoid Focal Loss CUDA code doesn't support FP 16(Half). For example, I could train normal FasterRCNN-R50-FPN but failed for "RETINANET-FasterRCNN-R50-FPN which use the sigmoid focal loss function) on FP16(using O1 opt).

When I use Sigmoid Focal Loss CPU version, it runs OK for both the original maskrcnn-benchsmark's Retinanet models and FCOS's retinanet models.

BUT estimated training time is almost "1 month"!! and after some itteration, only sigmoid focal loss becomes "nan"

2019-08-16 17:42:57,158 maskrcnn_benchmark.trainer INFO: Start training Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0 2019-08-16 17:45:22,396 maskrcnn_benchmark.trainer INFO: eta: 30 days, 6:08:34 iter: 20 loss: 4.3898 (5.7831) loss_centerness: 0.6876 (0.7018) loss_cls: 1.0846 (1.0754) loss_reg: 2.5395 (4.0059) time: 8.5586 (7.2618) data: 0.0029 (0.0810) lr: 0.003333 max mem: 6133 2019-08-16 17:46:51,064 maskrcnn_benchmark.trainer INFO: eta: 24 days, 8:41:46 iter: 40 loss: 3.3856 (4.6105) loss_centerness: 0.6687 (0.6854) loss_cls: 1.0689 (1.0852) loss_reg: 1.6177 (2.8399) time: 3.0172 (5.8476) data: 0.0025 (0.0420) lr: 0.003333 max mem: 6133 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125 2019-08-16 17:48:29,613 maskrcnn_benchmark.trainer INFO: eta: 23 days, 1:59:50 iter: 60 loss: nan (nan) loss_centerness: 0.6633 (0.6788) loss_cls: nan (nan) loss_reg: 1.5505 (2.3987) time: 4.5355 (5.5409) data: 0.0030 (0.0291) lr: 0.003333 max mem: 6133 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08 2019-08-16 17:49:48,975 maskrcnn_benchmark.trainer INFO: eta: 21 days, 10:39:17 iter: 80 loss: nan (nan) loss_centerness: 0.6606 (0.6748) loss_cls: nan (nan) loss_reg: 1.5613 (2.1857) time: 3.1732 (5.1477) data: 0.0024 (0.0225) lr: 0.003333 max mem: 6133 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14 2019-08-16 17:51:34,847 maskrcnn_benchmark.trainer INFO: eta: 21 days, 13:32:36 iter: 100 loss: nan (nan) loss_centerness: 0.6631 (0.6728) loss_cls: nan (nan) loss_reg: 1.5312 (2.0573) time: 7.3970 (5.1769) data: 0.0026 (0.0186) lr: 0.003333 max mem: 6133 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.710505431213761e-20 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.710505431213761e-20 2019-08-16 17:52:37,750 maskrcnn_benchmark.trainer INFO: eta: 20 days, 3:39:49 iter: 120 loss: nan (nan) loss_centerness: 0.6604 (0.6710) loss_cls: nan (nan) loss_reg: 1.5168 (1.9788) time: 2.9350 (4.8383) data: 0.0030 (0.0161) lr: 0.003333 max mem: 6133 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21

To Reproduce

after installing this repo, I just run following command python3 -m torch.distributed.launch --nproc_per_node=$NGPUS /home/ktai01/maskrcnn-benchmark/tools/train_net.py --config-file /home/ktai01/maskrcnn-benchmark/configs/retinanet/retinanet_R-50-FPN_P5_1x.yaml MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 1000 DTYPE "float16"

Traceback (most recent call last): File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 191, in main() File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 184, in main model = train(cfg, args.local_rank, args.distributed) File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 85, in train arguments, File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 71, in do_train loss_dict = model(images, targets) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 376, in forward output = self.module(*inputs[0], *kwargs[0]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward proposals, proposal_losses = self.rpn(images, features, targets) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 131, in forward return self._forward_train(anchors, box_cls, box_regression, targets) File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 138, in _forward_train anchors, box_cls, box_regression, targets File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/loss.py", line 77, in call labels File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 68, in forward loss = loss_func(logits, targets, self.gamma, self.alpha) File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 19, in forward logits, targets, num_classes, gamma, alpha RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half' (operator() at /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/SigmoidFocalLoss_cuda.cu:139)* frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fdc14f1b441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fdc14f1ad7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so) frame #2: + 0x4917f (0x7fdbae35c17f in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #3: SigmoidFocalLoss_forward_cuda(at::Tensor const&, at::Tensor const&, int, float, float) + 0x606 (0x7fdbae35c7f5 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #4: SigmoidFocalLoss_forward(at::Tensor const&, at::Tensor const&, int, float, float) + 0x64 (0x7fdbae32cb44 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #5: + 0x28fcf (0x7fdbae33bfcf in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #6: + 0x25291 (0x7fdbae338291 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #7: /usr/bin/python3() [0x5030d5] frame #8: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #9: /usr/bin/python3() [0x504c28] frame #10: /usr/bin/python3() [0x58644b] frame #11: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #12: THPFunction_apply(_object, _object*) + 0x6b1 (0x7fdc11ec5481 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so) frame #13: /usr/bin/python3() [0x502d6f] frame #14: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #15: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3) frame #16: /usr/bin/python3() [0x591461] frame #17: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #18: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3) frame #19: /usr/bin/python3() [0x504c28] frame #20: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3) frame #21: /usr/bin/python3() [0x591461] frame #22: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #23: /usr/bin/python3() [0x54d4e2] frame #24: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3) frame #25: /usr/bin/python3() [0x503073] frame #26: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #27: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3) frame #28: /usr/bin/python3() [0x591461] frame #29: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #30: /usr/bin/python3() [0x54d4e2] frame #31: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3) frame #32: /usr/bin/python3() [0x503073] frame #33: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #34: /usr/bin/python3() [0x502209] frame #35: /usr/bin/python3() [0x502f3d] frame #36: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #37: /usr/bin/python3() [0x504c28] frame #38: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3) frame #39: /usr/bin/python3() [0x591461] frame #40: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #41: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3) frame #42: /usr/bin/python3() [0x504c28] frame #43: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3) frame #44: /usr/bin/python3() [0x591461] frame #45: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #46: /usr/bin/python3() [0x54d4e2] frame #47: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3) frame #48: /usr/bin/python3() [0x503073] frame #49: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #50: /usr/bin/python3() [0x504c28] frame #51: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3) frame #52: /usr/bin/python3() [0x591461] frame #53: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #54: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3) frame #55: /usr/bin/python3() [0x504c28] frame #56: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3) frame #57: /usr/bin/python3() [0x591461] frame #58: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #59: /usr/bin/python3() [0x54d4e2] frame #60: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #61: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3) frame #62: /usr/bin/python3() [0x504c28] frame #63: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)

Traceback (most recent call last): File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 191, in main() File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 184, in main model = train(cfg, args.local_rank, args.distributed) File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 85, in train arguments, File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 71, in do_train loss_dict = model(images, targets) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 376, in forward output = self.module(*inputs[0], *kwargs[0]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward proposals, proposal_losses = self.rpn(images, features, targets) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 131, in forward return self._forward_train(anchors, box_cls, box_regression, targets) File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 138, in _forward_train anchors, box_cls, box_regression, targets File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/loss.py", line 77, in call labels File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 68, in forward loss = loss_func(logits, targets, self.gamma, self.alpha) File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 19, in forward logits, targets, num_classes, gamma, alpha RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half' (operator() at /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/SigmoidFocalLoss_cuda.cu:139)* frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fd51e596441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fd51e595d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so) frame #2: + 0x4917f (0x7fd4b872217f in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #3: SigmoidFocalLoss_forward_cuda(at::Tensor const&, at::Tensor const&, int, float, float) + 0x606 (0x7fd4b87227f5 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #4: SigmoidFocalLoss_forward(at::Tensor const&, at::Tensor const&, int, float, float) + 0x64 (0x7fd4b86f2b44 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #5: + 0x28fcf (0x7fd4b8701fcf in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #6: + 0x25291 (0x7fd4b86fe291 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #7: /usr/bin/python3() [0x5030d5] frame #8: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #9: /usr/bin/python3() [0x504c28] frame #10: /usr/bin/python3() [0x58644b] frame #11: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #12: THPFunction_apply(_object, _object*) + 0x6b1 (0x7fd51edb3481 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so) frame #13: /usr/bin/python3() [0x502d6f] frame #14: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #15: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3) frame #16: /usr/bin/python3() [0x591461] frame #17: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #18: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3) frame #19: /usr/bin/python3() [0x504c28] frame #20: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3) frame #21: /usr/bin/python3() [0x591461] frame #22: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #23: /usr/bin/python3() [0x54d4e2] frame #24: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3) frame #25: /usr/bin/python3() [0x503073] frame #26: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #27: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3) frame #28: /usr/bin/python3() [0x591461] frame #29: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #30: /usr/bin/python3() [0x54d4e2] frame #31: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3) frame #32: /usr/bin/python3() [0x503073] frame #33: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #34: /usr/bin/python3() [0x502209] frame #35: /usr/bin/python3() [0x502f3d] frame #36: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #37: /usr/bin/python3() [0x504c28] frame #38: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3) frame #39: /usr/bin/python3() [0x591461] frame #40: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #41: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3) frame #42: /usr/bin/python3() [0x504c28] frame #43: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3) frame #44: /usr/bin/python3() [0x591461] frame #45: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #46: /usr/bin/python3() [0x54d4e2] frame #47: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3) frame #48: /usr/bin/python3() [0x503073] frame #49: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3) frame #50: /usr/bin/python3() [0x504c28] frame #51: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3) frame #52: /usr/bin/python3() [0x591461] frame #53: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #54: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3) frame #55: /usr/bin/python3() [0x504c28] frame #56: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3) frame #57: /usr/bin/python3() [0x591461] frame #58: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #59: /usr/bin/python3() [0x54d4e2] frame #60: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3) frame #61: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3) frame #62: /usr/bin/python3() [0x504c28] frame #63: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 235, in main() File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', '/home/ktai01/maskrcnn-benchmark/tools/train_net.py', '--local_rank=0', '--config-file', '/home/ktai01/maskrcnn-benchmark/configs/retinanet/retinanet_R-50-FPN_P5_1x.yaml', 'MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN', '1000', 'DTYPE', 'float16']' returned non-zero exit status 1.

stanstarks commented 5 years ago

Hi, @dedoogong ,

I think you can convert the logits to float32 before the loss computation and convert back in the forward function

orig_type = logits.dtype
logits = logits.type(torch.float32)
loss = loss_func(logits, targets, self.gamma, self.alpha)
return loss.sum().type(orig_type)

https://github.com/facebookresearch/maskrcnn-benchmark/blob/24c8c90efdb7cc51381af5ce0205b23567c3cd21/maskrcnn_benchmark/layers/sigmoid_focal_loss.py#L61-L69