Tsinghua-MARS-Lab / futr3d

Code for paper: FUTR3D: a unified sensor fusion framework for 3d detection
Apache License 2.0
273 stars 38 forks source link

RuntimeError: Expected to mark a variable ready only once. #16

Closed drilistbox closed 1 year ago

drilistbox commented 2 years ago

To run the plugin/futr3d/configs/lidar_cam/res101_01voxel_step_3e.py, I replace the FPNV2 by FPN and load_from = 'pretrained/res101_01voxel_pretrained.pth' was also commented out firstly. But I still meet the error beflow:

Traceback (most recent call last): File "tools/train.py", line 263, in main() File "tools/train.py", line 252, in main train_model( File "/home/projects/bev/futr3d/mmdetection3d/mmdet3d/apis/train.py", line 28, in train_model train_detector( File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/mmdet/apis/train.py", line 174, in train_detector runner.run(data_loaders, cfg.workflow) File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], *kwargs) File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train self.call_hook('after_train_iter') File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter runner.outputs['loss'].backward() File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward Variable._execution_engine.run_backward( File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/torch/autograd/function.py", line 89, in apply return self._forward_cls.backward(self, args) # type: ignore File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 112, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward Variable._execution_engine.run_backward( RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

zen-star commented 2 years ago

have you solved the problem yet? i meet the same error at the beginning of distributed training. If I set with_cp=False, it is ok but consumes too much gpu memory.

Chiang97912 commented 1 year ago

have you solved the problem yet? i meet the same error at the beginning of distributed training. If I set with_cp=False, it is ok but consumes too much gpu memory.

Thanks, this works for me! Turning off the checkpoint can solve this problem.