Closed drilistbox closed 1 year ago
have you solved the problem yet? i meet the same error at the beginning of distributed training. If I set with_cp=False, it is ok but consumes too much gpu memory.
have you solved the problem yet? i meet the same error at the beginning of distributed training. If I set with_cp=False, it is ok but consumes too much gpu memory.
Thanks, this works for me! Turning off the checkpoint can solve this problem.
To run the plugin/futr3d/configs/lidar_cam/res101_01voxel_step_3e.py, I replace the FPNV2 by FPN and load_from = 'pretrained/res101_01voxel_pretrained.pth' was also commented out firstly. But I still meet the error beflow:
Traceback (most recent call last): File "tools/train.py", line 263, in
main()
File "tools/train.py", line 252, in main
train_model(
File "/home/projects/bev/futr3d/mmdetection3d/mmdet3d/apis/train.py", line 28, in train_model
train_detector(
File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/mmdet/apis/train.py", line 174, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], *kwargs)
File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.call_hook('after_train_iter')
File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter
runner.outputs['loss'].backward()
File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/torch/autograd/function.py", line 89, in apply
return self._forward_cls.backward(self, args) # type: ignore
File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 112, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/workspace/miniconda3/envs/futr3d/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the
forward
function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiplecheckpoint
functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.