Sense-X / TSD

1st place models in Google OpenImage Detection Challenge 2019
Apache License 2.0
456 stars 64 forks source link

An error occurred while using fp16 #2

Closed Dragonsson closed 4 years ago

Dragonsson commented 4 years ago

Traceback (most recent call last): File "tools/train.py", line 151, in main() File "tools/train.py", line 147, in main meta=meta) File "/cache/user-job-dir/codes/TSD-master/mmdet/apis/train.py", line 165, in train_detector runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/work/anaconda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 380, in run epoch_runner(data_loaders[i], kwargs) File "/home/work/anaconda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 278, in train self.model, data_batch, train_mode=True, kwargs) File "/cache/user-job-dir/codes/TSD-master/mmdet/apis/train.py", line 75, in batch_processor losses = model(data) File "/home/work/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/work/anaconda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(inputs[0], kwargs[0]) File "/home/work/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/cache/user-job-dir/codes/TSD-master/mmdet/core/fp16/decorators.py", line 75, in new_func output = old_func(new_args, new_kwargs) File "/cache/user-job-dir/codes/TSD-master/mmdet/models/detectors/base.py", line 147, in forward return self.forward_train(img, img_metas, kwargs) File "/cache/user-job-dir/codes/TSD-master/mmdet/models/detectors/two_stage.py", line 218, in forward_train cls_score, bbox_pred, TSD_cls_score, TSD_bbox_pred, delta_c, delta_r = self.bbox_head(bbox_feats, x[:self.bbox_roi_extractor.num_inputs], rois) File "/home/work/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/cache/user-job-dir/codes/TSD-master/mmdet/models/bbox_heads/tsd_bbox_head.py", line 245, in forward tsd_feats_cls = self.align_poolingpc[i](feats[i], rois, deltac) File "/home/work/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/cache/user-job-dir/codes/TSD-master/mmdet/ops/dcn/deform_pool.py", line 341, in forward self.trans_std)
File "/cache/user-job-dir/codes/TSD-master/mmdet/ops/dcn/deform_pool.py", line 50, in forward ctx.part_size, ctx.sample_per_part, ctx.trans_std) RuntimeError: expected scalar type Half but found Float (data at /home/work/anaconda/lib/python3.6/site-packages/torch/include/ATen/core/TensorMethods.h:1386) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f532d3d4441 in /home/work/anaconda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f532d3d3d7a in /home/work/anaconda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: c10::Half
at::Tensor::data() const + 0xcf (0x7f524f2b135f in /cache/user-job-dir/codes/TSD-master/mmdet/ops/dcn/deform_pool_cuda.cpython-36m-x86_64-linux-gnu.so) frame #3: + 0x1aed4 (0x7f524f2aced4 in /cache/user-job-dir/codes/TSD-master/mmdet/ops/dcn/deform_pool_cuda.cpython-36m-x86_64-linux-gnu.so) frame #4: + 0x1b681 (0x7f524f2ad681 in /cache/user-job-dir/codes/TSD-master/mmdet/ops/dcn/deform_pool_cuda.cpython-36m-x86_64-linux-gnu.so) frame #5: DeformablePSROIPoolForward(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, int, int, int, int, int, int, int, float, int, int, int, int, int, float) + 0x1aa (0x7f524f2ad944 in /cache/user-job-dir/codes/TSD-master/mmdet/ops/dcn/deform_pool_cuda.cpython-36m-x86_64-linux-gnu.so) frame #6: deform_psroi_pooling_cuda_forward(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, int, float, int, int, int, int, int, float) + 0x202 (0x7f524f29e892 in /cache/user-job-dir/codes/TSD-master/mmdet/ops/dcn/deform_pool_cuda.cpython-36m-x86_64-linux-gnu.so) frame #7: + 0x1952d (0x7f524f2ab52d in /cache/user-job-dir/codes/TSD-master/mmdet/ops/dcn/deform_pool_cuda.cpython-36m-x86_64-linux-gnu.so) frame #8: + 0x197ee (0x7f524f2ab7ee in /cache/user-job-dir/codes/TSD-master/mmdet/ops/dcn/deform_pool_cuda.cpython-36m-x86_64-linux-gnu.so) frame #9: + 0x160a5 (0x7f524f2a80a5 in /cache/user-job-dir/codes/TSD-master/mmdet/ops/dcn/deform_pool_cuda.cpython-36m-x86_64-linux-gnu.so)

frame #16: THPFunction_apply(_object*, _object*) + 0x6b1 (0x7f532dbb2481 in /home/work/anaconda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Dragonsson commented 4 years ago

and when I cancel fp16,normal runing。

songguanglu commented 4 years ago

In the current version, the TSD only support for the fp32 training and we will update the fp16 training in the next update.

Dragonsson commented 4 years ago

looking forward!