aim-uofa / AdelaiDet

AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.
https://git.io/AdelaiDet
Other
3.38k stars 650 forks source link

Error backprop Solov2 #552

Closed GiteZz closed 2 years ago

GiteZz commented 2 years ago

I was trying to run the Solov2 on my own dataset and I get this error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8, 128, 200, 200]], which is output 0 of ReluBackward0, is at version 3; expected version 0 instead.

This was the traceback associated with the error:

Traceback (most recent call last):
  File "train_solo.py", line 116, in <module>
    main()
  File "/workspaces/mono/python/train_utils/train_utils/__init__.py", line 38, in inner
    func(config)
  File "train_solo.py", line 103, in main
    train(config)
  File "/workspaces/mono/python/train_utils/train_utils/__init__.py", line 155, in inner
    func(AddDcitNoDefault(config))
  File "train_solo.py", line 61, in train
    trainer.train()
  File "/workspaces/mono/python/instance_segmentation/instance_segmentation/train_net.py", line 102, in train
    self.train_loop(self.start_iter, self.max_iter)
  File "/workspaces/mono/python/instance_segmentation/instance_segmentation/train_net.py", line 91, in train_loop
    self.run_step()
  File "/usr/local/lib/python3.8/dist-packages/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "/usr/local/lib/python3.8/dist-packages/detectron2/engine/train_loop.py", line 285, in run_step
    losses.backward()
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(

Torch adviced me to set_detect_anoomaly(True) in the code to find more clues. With this change I got this as extra output:

[W python_anomaly_mode.cpp:104] Warning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
  File "train_solo.py", line 116, in <module>
    main()
  File "/workspaces/mono/python/train_utils/train_utils/__init__.py", line 38, in inner
    func(config)
  File "train_solo.py", line 103, in main
    train(config)
  File "/workspaces/mono/python/train_utils/train_utils/__init__.py", line 155, in inner
    func(AddDcitNoDefault(config))
  File "train_solo.py", line 61, in train
    trainer.train()
  File "/workspaces/mono/python/instance_segmentation/instance_segmentation/train_net.py", line 102, in train
    self.train_loop(self.start_iter, self.max_iter)
  File "/workspaces/mono/python/instance_segmentation/instance_segmentation/train_net.py", line 91, in train_loop
    self.run_step()
  File "/usr/local/lib/python3.8/dist-packages/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "/usr/local/lib/python3.8/dist-packages/detectron2/engine/train_loop.py", line 273, in run_step
    loss_dict = self.model(data)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspaces/mono/python/AdelaiDet/adet/modeling/solov2/solov2.py", line 128, in forward
    mask_pred = self.mask_head(mask_features)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspaces/mono/python/AdelaiDet/adet/modeling/solov2/solov2.py", line 729, in forward
    feature_add_all_level = self.convs_all_levels[0](features[0])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/activation.py", line 98, in forward
    return F.relu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 1299, in relu
    result = torch.relu(input)

So apparently something is not backpropping correctly over a Relu operation. I've tried changing all the Relu to inplace=False, but that gave the same results.

Any idea what might cause this?

    detectron_config.DATASETS.TEST = ds_by_subset["val"]
    detectron_config.DATASETS.TRAIN = ds_by_subset["train"]

    detectron_config.SOLVER.IMS_PER_BATCH = cfg.model.batch_size
    detectron_config.SOLVER.MAX_ITER = cfg.model.max_iter
    detectron_config.SOLVER.CHECKPOINT_PERIOD = cfg.model.checkpoint_period
    detectron_config.TEST.EVAL_PERIOD = cfg.model.eval_period
    detectron_config.MODEL.ROI_HEADS.NUM_CLASSES = len(cfg.classes)

These are the differences that I make in the config that differ from the one in train_net

605436079 commented 2 years ago

I also encountered this problem. Has this problem been solved

KingAlejandro commented 2 years ago

I am having the same issue

hmchuong commented 2 years ago

I change this line to

feature_add_all_level = feature_add_all_level + self.convs_all_levels[i](mask_feat)

and it is fixed

aghand0ur commented 2 years ago

check this pull request: https://github.com/aim-uofa/AdelaiDet/pull/568