amazon-science / bigdetection

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training
Apache License 2.0
390 stars 24 forks source link

Error when training HTC-CBV2 #11

Closed liming-ai closed 2 years ago

liming-ai commented 2 years ago

Hi @bryanyzhu @cailk

Thanks for your contribution, I tried to train the config and created an environment following README.

However, an error was raised:

Traceback (most recent call last):
  File "tools/train.py", line 188, in <module>
    main()
  File "tools/train.py", line 184, in main
    meta=meta)
  File "/home/tiger/code/bigdetection/mmdet/apis/train.py", line 189, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/tiger/code/bigdetection/mmdet/utils/optimizer.py", line 26, in after_train_iter
    scaled_loss.backward()
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [1, 256, 68, 92]], which is output 0 of ReluBackward0, is at version 4; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3739365) of binary: /home/tiger/miniconda3/envs/cbv2/bin/python

After add torch.autograd.set_detect_anomaly(True), it shows:

  File "tools/train.py", line 188, in <module>
    main()
  File "tools/train.py", line 184, in main
    meta=meta)
  File "/home/tiger/code/bigdetection/mmdet/apis/train.py", line 189, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 53, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/tiger/code/bigdetection/mmdet/models/detectors/base.py", line 237, in train_step
    losses = self(**data)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/tiger/code/bigdetection/mmdet/models/detectors/base.py", line 171, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/tiger/code/bigdetection/mmdet/models/detectors/two_stage.py", line 266, in forward_train
    **kwargs)
  File "/home/tiger/code/bigdetection/mmdet/models/roi_heads/htc_roi_head.py", line 244, in forward_train
    semantic_pred, semantic_feat = self.semantic_head(x)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/tiger/code/bigdetection/mmdet/models/roi_heads/mask_heads/fused_semantic_head.py", line 86, in forward
    x = self.lateral_convs[self.fusion_level](feats[self.fusion_level])
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/cnn/bricks/conv_module.py", line 202, in forward
    x = self.activate(x)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 98, in forward
    return F.relu(input, inplace=self.inplace)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/functional.py", line 1299, in relu
    result = torch.relu(input)
 (function _print_stack)
bryanyzhu commented 2 years ago

@cailk is investigating on it, will update here soon.

cailk commented 2 years ago

Hi, sorry for the late reply. Well, this config can be implemented in our environment without errors. Would you please show me which version of MMCV & MMDet you are using?

liming-ai commented 2 years ago

Hi, sorry for the late reply. Well, this config can be implemented in our environment without errors. Would you please show me which version of MMCV & MMDet you are using?

Hi, thanks for your reply. I have fixed this issue.